Custom Annotation Models Part 2: Classification
This is Part 2 of a 3-part series series on building your own Annotation Models.
- We were introduced to the
AnnotationModelclass - We built a simple model, subclassing
AnnotationModel, to count words in a text file - We tested the model with dependencies using the
dorsal.testingmodule - We registered the model to our Annotation Model Pipeline
- Finally we removed the model (because it's not very useful to keep)
In this chapter, we will build an Annotation Model which labels PDF documents based on their content.
SensitiveDocumentScanner
We are going to build a simple PDF document classification model called SensitiveDocumentScanner.
This model adds labels to documents if any "sensitive" key words and phrases (like "confidential" or "internal use only") are found in the text.
It outputs an open/classification annotation record - a suitable format for storing labels for documents.
Here is how a schema-compliant annotation record might look, for a record processed by the SensitiveDocumentScanner:
- This is a "positive" example (meaning one or more labels was applied)
- We include the optional
vocabularyfield as it makes the annotation self-contained and understandable.
The logic that SensitiveDocumentScanner captures is very simple:
- Define a set of key words and phrases and their associated labels (e.g.
"internal use only"->Internal) - Scan the text of a PDF document, and look for any of these terms
- If any are found, apply the relevant label to the document
This approach is quite general and can be adapted to many use-cases.
For simplicity, this guide uses a simple string check, but it could easily be adapted to use more sophisticated approaches.
Here is the complete code for the model:
Schema Focus: open/classification
Unlike tagging, which simply applies a label to a file, classification tasks usually involve well-defined constraints.
The open/classification schema represent both the result (the applied labels) and the context (e.g. possible labels, confidence in the result).
By treating classification as a structured record rather than a loose collection of keywords, we ensure downstream systems can reliably interpret the data - whether it comes from a simple script, a human-in-the-loop workflow, or a machine learning model.
We can see these advantages in action when we apply the schema to our SensitiveDocumentScanner:
-
Vocabulary:
open/classificationsupports thevocabularyfield - a list of all possible labels for the classification task.This removes ambiguity, makes positive results self-contained, and makes even a negative result meaningful:
This record tells us: "We checked for 'Confidential' and 'Internal' markers, and they are NOT present." -
Confidence Score:
open/classificationlabels support confidence scoring (how certain the model is) via thescoreandscore_explanationfields.Our string-matching approach in
SensitiveDocumentScannereither finds a keyword or it doesn't: in this context a "confidence score" is difficult to define.However, if instead we were to use a pre-trained machine learning model for our inference, including a probabilistic
This standardization allows downstream systems to support any classification result, regardless of task or model complexity, without requiring code changes.scoreadds value for downstream consumers of the annotation: -
Attributes
The schema supports model-specific metadata via the
attributesobject, supporting additional context within the existing validation structure.While
SensitiveDocumentScannerapplies a label to the whole file, we could easily extend it to indicate where the sensitive content was found:
The main method
-
We use
extract_pdf_textto retrieve the raw text content from the document as a list of strings, where each string is the text of a single page. -
We then use a simple loop to go over each page, checking for any of our defined "sensitive" key words or phrases.
-
Finally, we call
build_classification_recordto create anopen/classificationschema-compliant record containing the result. -
Note that, even in the case where no labels were found, we may wish to return a record including the
vocabulary(the full list of labels) to show that the document was processed by the model, and no labels were found.
File type flexibility
The SensitiveDocumentScanner example is designed to label the text content of PDF documents.
We could easily open this model up to other document types by adding additional calls to helpers to extract text from other file types.
At run-time, attributes such as media_type, extension and name are available on your annotation model
You could make use of these to perform conditional checks and use different text extraction logic depending on the document type:
Class variables
The AnnotationModel class has some optional class variables:
-
id: A short identifier for the model, used downstream. If not set, the class name (e.g. "SensitiveDocumentScanner") is used. -
version: The version of the model, e.g. "1.0.0" (default:None) -
variant: Additional identifier for the model, for when more than one variant exists, e.g. "nano" or "gpu-large" (default:None)
Annotation Model id field
If you are publishing public annotations, the best way to document your annotation models is to use the id class variable.
For complex annotation models, where additional documentation may be valuable, a clear structure like github:my-organization/my-annotation-model is the most self-documenting.
Additionally, If you structure the id class variable to include either github or gitlab prefix, then when the annotation is viewed on DorsalHub, a clickable link is created.
Testing the Model
-
Let's test the model using the
run_modelfunction fromdorsal.testing. -
This time, we will create a PDF document dependency.
The result will look something like this:
{
"name": "SensitiveDocumentScanner",
"source": {
"type": "Model",
"model": "github:dorsalhub/annotation-model-examples",
"version": "1.0.0",
"variant": null
},
"record": {
"labels": [
{
"label": "Internal"
}
],
"vocabulary": [
"Confidential",
"Internal"
]
},
"schema_id": "open/classification",
"schema_version": "1.0.0",
"time_taken": 0.019788757024798542,
"error": null
}
Adding it to the Pipeline
-
Now that we've tested the model, let's use the
register_modelfunction fromdorsal.apito add it to our pipeline. -
Make sure that
SensitiveDocumentScanneris defined outside of__main__in an importable path (e.g. insensitivedocumentscanner.py)
Model must be importable
If you have been following this tutorial so far in Jupyter, make sure to move your SensitiveDocumentScanner class to a .py file before registering it to the pipeline.
Registering a model copies its import path to your project config file, so it must be defined outside of __main__.
e.g. from sensitivedocumentscanner import SensitiveDocumentScanner where sensitivedocumentscanner.py is in the same directory as your main script/notebook.
You can confirm the model exists in the pipeline by calling show_model_pipeline, or by running dorsal config pipeline show in the CLI.
Testing the pipeline
Let's try it out!
from dorsal import LocalFile
lf = LocalFile("./test/documents/expenses-Q3.pdf", overwrite_cache=False)
print(lf.to_json())
Prints the full File Record including our new model's annotation:
Displays the full File Record including our new model's annotation:
📄 Scanning metadata for expenses-Q3.pdf
╭────────────────────────── File Record: expenses-Q3.pdf ────────────────────────────╮
│ │
│ Hashes │
│ SHA-256: 51d72e1186ae0b1e82ba0a4e77e10051231d8c4367f973a002e7afce79a7c094 │
│ BLAKE3: ed39870c204c1e4d82b1aab87190169611fe8f046c24f06062bd6be222035e68 │
│ │
│ File Info │
│ Full Path: /dev/test/documents/expenses-Q3.pdf │
│ Modified: 2025-03-29 10:27:33 │
│ Name: expenses-Q3.pdf │
│ Size: 107 KiB │
│ Media Type: application/pdf │
│ │
│ Tags │
│ No tags found. │
│ │
│ Pdf Info │
│ title: Expenses Quarter 3 2024 │
│ producer: PDFlib+PDI 9.1.2p1 (PHP5/Linux-x86_64) │
│ version: 1.7 │
│ page_count: 2 │
│ creation_date: 2024-10-06T17:17:45+02:00 │
│ modified_date: 2024-10-11T10:32:03+01:00 │
│ │
│ Classification Info │
│ labels: │
│ label: Internal │
│ vocabulary: Confidential, Internal │
│ │
│ │
╰────────────────────────────────────────────────────────────────────────────────────╯
Cleanup
Let's remove SensitiveDocumentScanner from our pipeline using the CLI command dorsal config pipeline remove:
In the final part of this series, we'll build a more complex Annotation Model for entity extraction.
➡️ Continue to 6. Custom Annotation Models Part 3: Entity Extraction