Annotation Models
An Annotation Model is the fundamental unit of logic in the Annotation Model Pipeline.
While the Pipeline handles the flow of data, the Model is responsible for the actual work: analyzing a specific file and producing a structured Annotation.
Build your own
This document serves as the technical specification for Annotation Models.
If you want to learn how to write one from scratch, see the Python Guide: Custom Annotation Models.
The Model Interface
Internally, an Annotation Model is a Python class inheriting from dorsal.AnnotationModel. It adheres to a strict input/output contract managed by the ModelRunner.
1. The Input Contract (Attributes)
When the pipeline instantiates your model, it automatically populates specific instance attributes before calling your logic. These attributes allow your model to make decisions based on the file's basic properties without having to re-calculate them.
| Attribute | Type | Description |
|---|---|---|
self.file_path |
str |
The absolute path to the file on disk. |
self.media_type |
str |
The IANA Media Type (e.g., application/pdf, text/plain) identified by the Base model. |
self.extension |
str |
The file extension (lowercase, e.g., .docx). |
self.size |
int |
The file size in bytes. |
self.hash |
str |
The SHA-256 hash of the file. |
self.name |
str |
The filename (e.g., report.pdf). |
2. The Output Contract (Return Values)
The model's main() method must return one of two types:
dict: A dictionary containing the extracted metadata.- This dictionary must validate against the schema ID configured for this model in the pipeline.
- If validation fails, the error is logged, and the annotation is discarded.
None: Indicates that the model ran successfully but found no relevant data, or encountered a handled error.
3. Error Handling
If a model encounters an issue (e.g., a corrupt PDF or a missing dependency), it should not raise an exception (which halts the pipeline). Instead, it should use the set_error method and return None.
def main(self):
try:
# ... logic ...
except Exception as e:
self.set_error(f"Processing failed: {e}")
return None
Model Identity
To ensure annotations are unique and traceable, every model class defines three identity fields. These are stored in the source field of the final Annotation.
| Field | Description | Example |
|---|---|---|
id |
A global identifier for the model. | github:dorsalhub/pdf-model |
version |
The semantic version of the model logic. | 1.2.0 |
variant |
(Optional) The specific engine or configuration used. | layoutlm-v3 |
Updates vs. New Records
DorsalHub uses the combination of all three fields (id, version, variant) to decide if an annotation is new or an update.
- Same Identity = Update: If you run the exact same model (same
id,version, andvariant) on the same file twice, the second run will overwrite the first. - Different Identity = New Record: If you change any of the three fields (e.g. bumping the
versionor changing thevariant), the new result is saved as a separate record alongside the old one. - User Scoped: Your annotations are yours. Even if another user runs the exact same model on the same file, it will not overwrite your data.
Discoverability Links
If you set your model's id field with a prefix of either github: or gitlab: followed by an organization/repo format, DorsalHub will automatically generate a clickable link to the source code when displaying the annotation.
- Good:
github:dorsalhub/annotation-model-examples - Valid (but no link):
company.internal.models.classifier
Dependencies
Dependencies are configuration rules defined in dorsal.toml that control when a specific model should run.
If a model has no dependencies, it runs on every file scan. Dependencies allow you to restrict execution to specific file types, sizes, or names.
1. Media Type
Restricts execution based on the file's IANA Media Type. This is the preferred method for file-type filtering.
2. File Extension
Restricts execution to specific file extensions. Useful for formats where Media Type detection is ambiguous.
3. File Size
Prevents heavy models from running on files that are too large (or too small). Accepts bytes (integer) or strings (KB, MB, GB).
4. File Name
Restricts execution based on a Regex pattern match against the filename.
Strict Mode
By default, dependencies are silent (silent = true). If a dependency is not met, the model is skipped, and the pipeline continues.
If you set silent = false, the dependency becomes strict. If a strict dependency is not met, the pipeline halts with an error.
# This pipeline will crash if it encounters a non-PDF file
dependencies = [
{ type = "media_type", include = ["application/pdf"], silent = false }
]
Using Dependencies
The easiest way to manage dependencies is with the dorsal.file.dependencies module.
Set dependencies to add constraints when you are testing a model or adding a model to the pipeline
from dorsal.file.dependencies import make_media_type_dependency
dependency = make_media_type_dependency(include=["application/pdf"])
print(dependency.model_dump_json(indent=2))
This will print the following:
{
"type": "media_type",
"checker": [
"dorsal.file.configs.model_runner",
"check_media_type_dependency"
],
"silent": true,
"pattern": null,
"include": [
"application/pdf"
],
"exclude": null
}
- type: There are four possible types:
extension,file_name,file_sizeandmedia_type - checker: This is the importable path for the function which validates the dependency
- silent: When
false, if the dependency is not met, the pipeline will stop and raise an error. Default istrue - pattern:
file_nameandmedia_typedependencies support regular expressions for string matching - include: Positive filter for
media_type: the file's media type must match one of the values in the array - exclude: Negative filter for
media_type: the file's media type must not match one of the values in the array