Skip to content

Annotation Models

An Annotation Model is the fundamental unit of logic in the Annotation Model Pipeline.

While the Pipeline handles the flow of data, the Model is responsible for the actual work: analyzing a specific file and producing a structured Annotation.

Build your own

This document serves as the technical specification for Annotation Models.

If you want to learn how to write one from scratch, see the Python Guide: Custom Annotation Models.

The Model Interface

Internally, an Annotation Model is a Python class inheriting from dorsal.AnnotationModel. It adheres to a strict input/output contract managed by the ModelRunner.

1. The Input Contract (Attributes)

When the pipeline instantiates your model, it automatically populates specific instance attributes before calling your logic. These attributes allow your model to make decisions based on the file's basic properties without having to re-calculate them.

Attribute Type Description
self.file_path str The absolute path to the file on disk.
self.media_type str The IANA Media Type (e.g., application/pdf, text/plain) identified by the Base model.
self.extension str The file extension (lowercase, e.g., .docx).
self.size int The file size in bytes.
self.hash str The SHA-256 hash of the file.
self.name str The filename (e.g., report.pdf).

2. The Output Contract (Return Values)

The model's main() method must return one of two types:

  1. dict: A dictionary containing the extracted metadata.
    • This dictionary must validate against the schema ID configured for this model in the pipeline.
    • If validation fails, the error is logged, and the annotation is discarded.
  2. None: Indicates that the model ran successfully but found no relevant data, or encountered a handled error.

3. Error Handling

If a model encounters an issue (e.g., a corrupt PDF or a missing dependency), it should not raise an exception (which halts the pipeline). Instead, it should use the set_error method and return None.

def main(self):
    try:
        # ... logic ...
    except Exception as e:
        self.set_error(f"Processing failed: {e}")
        return None

Model Identity

To ensure annotations are unique and traceable, every model class defines three identity fields. These are stored in the source field of the final Annotation.

Field Description Example
id A global identifier for the model. github:dorsalhub/pdf-model
version The semantic version of the model logic. 1.2.0
variant (Optional) The specific engine or configuration used. layoutlm-v3

Updates vs. New Records

DorsalHub uses the combination of all three fields (id, version, variant) to decide if an annotation is new or an update.

  • Same Identity = Update: If you run the exact same model (same id, version, and variant) on the same file twice, the second run will overwrite the first.
  • Different Identity = New Record: If you change any of the three fields (e.g. bumping the version or changing the variant), the new result is saved as a separate record alongside the old one.
  • User Scoped: Your annotations are yours. Even if another user runs the exact same model on the same file, it will not overwrite your data.

If you set your model's id field with a prefix of either github: or gitlab: followed by an organization/repo format, DorsalHub will automatically generate a clickable link to the source code when displaying the annotation.

  • Good: github:dorsalhub/annotation-model-examples
  • Valid (but no link): company.internal.models.classifier

Dependencies

Dependencies are configuration rules defined in dorsal.toml that control when a specific model should run.

If a model has no dependencies, it runs on every file scan. Dependencies allow you to restrict execution to specific file types, sizes, or names.

1. Media Type

Restricts execution based on the file's IANA Media Type. This is the preferred method for file-type filtering.

dependencies = [
    { type = "media_type", include = ["application/pdf", "image"] }
]

2. File Extension

Restricts execution to specific file extensions. Useful for formats where Media Type detection is ambiguous.

dependencies = [
    { type = "file_extension", extensions = [".epub", ".mobi"] }
]

3. File Size

Prevents heavy models from running on files that are too large (or too small). Accepts bytes (integer) or strings (KB, MB, GB).

dependencies = [
    { type = "file_size", max_size = "50MB" }
]

4. File Name

Restricts execution based on a Regex pattern match against the filename.

dependencies = [
    { type = "file_name", pattern = "^INVOICE_.*$" }
]

Strict Mode

By default, dependencies are silent (silent = true). If a dependency is not met, the model is skipped, and the pipeline continues.

If you set silent = false, the dependency becomes strict. If a strict dependency is not met, the pipeline halts with an error.

# This pipeline will crash if it encounters a non-PDF file
dependencies = [
    { type = "media_type", include = ["application/pdf"], silent = false }
]

Using Dependencies

The easiest way to manage dependencies is with the dorsal.file.dependencies module.

Set dependencies to add constraints when you are testing a model or adding a model to the pipeline

Creating a Media Type Dependency
from dorsal.file.dependencies import make_media_type_dependency

dependency = make_media_type_dependency(include=["application/pdf"])
print(dependency.model_dump_json(indent=2))

This will print the following:

{
  "type": "media_type",
  "checker": [
    "dorsal.file.configs.model_runner",
    "check_media_type_dependency"
  ],
  "silent": true,
  "pattern": null,
  "include": [
    "application/pdf"
  ],
  "exclude": null
}
  • type: There are four possible types: extension, file_name, file_size and media_type
  • checker: This is the importable path for the function which validates the dependency
  • silent: When false, if the dependency is not met, the pipeline will stop and raise an error. Default is true
  • pattern: file_name and media_type dependencies support regular expressions for string matching
  • include: Positive filter for media_type: the file's media type must match one of the values in the array
  • exclude: Negative filter for media_type: the file's media type must not match one of the values in the array