Annotation Model Pipeline

The Annotation Model Pipeline is a defined sequence of Annotation Models that run every time you tell Dorsal to extract metadata from a file, for example by running dorsal file scan or creating a new LocalFile instance:

CLIPython

dorsal file scan "C:\examples\PDFSPEC.pdf"

from dorsal import LocalFile

lf = LocalFile("C:\examples\PDFSPEC.pdf")

When a file is scanned, under the hood Dorsal is executing its configurable Annotation Model Pipeline for that file.

This pipeline's output is collected and assembled into a File Record object: a highly structured, validated and typed object containing metadata for that file.

The graph below shows the high-level flow of this pipeline.

graph LR
    %% === Main Graph Flow ===
    Start(Start Scan) --> Model_Base(dorsal/base<br><b>Annotation Model</b>);

    Model_Base --- sub_Start;

    %% === Subgraph (LR flow) ===
    subgraph PipelineLoop [For each Annotation Model in the Pipeline...]
        direction LR

        %% Invisible start/end nodes
        sub_Start((  )) --> CheckDeps{"Dependency Check"};

        CheckDeps -- All Satisfied / None --> RunModel(Execute<br>Annotation Model);

        %% The "Skip" path
        CheckDeps -- Dependency Not Met --- sub_End( ); 

        %% The "Run" path
        RunModel --- sub_End(( ));

    end

    %% *** FIX: Link from the *last node inside* the subgraph to the end ***
    sub_End --> MergeResults(Merge All Results)
    MergeResults --> End(End Scan);

Figure: Annotation Model Pipeline Execution Flow Graph

Pipeline Execution

Let's follow the execution of the graph from Start Scan to End Scan to understand what's happening:

The Base Model (dorsal/base)
```
graph LR
    Start(Start Scan) --> Model_Base(dorsal/base<br><b>Annotation Model</b>);
```
Every scan, regardless of file type, begins by executing the dorsal/base Annotation Model.

This mandatory step is responsible for generating the file/base annotation. This annotation contains the most fundamental file information, such as:
- name
- size
- media_type (e.g., application/pdf or image/jpeg)
- extension
The output of this annotation model is useful, as the media_type and extension it provides are often used by subsequent models in the Dependency Check step.
The Pipeline Loop
```
graph LR
    subgraph PipelineLoop [For each Annotation Model in the Pipeline...]
        direction LR

        %% Invisible start/end nodes
        sub_Start((  )) --> CheckDeps{"Dependency Check"};

        CheckDeps -- All Satisfied / None --> RunModel(Execute<br>Annotation Model);

        %% The "Skip" path
        CheckDeps -- Dependency Not Met --- sub_End( ); 

        %% The "Run" path
        RunModel --- sub_End(( ));

    end
```
After the file/base annotation model succeeds, the pipeline iterates through every other Annotation Model defined in its configuration. For each model, it follows the same path:
- Dependency Check:
  
  Before a model is executed, the pipeline checks if it has any dependencies. A dependency is a rule that determines if a model should run.
  
  Dependencies include:
  - Media Type: Does the file's media_type (from the file/base record) match a rule? (e.g., run dorsal/ebook only for application/epub+zip).
  - File Extension: Does the file's extension match a list? (e.g., .epub, .mobi).
  If all dependencies are satisfied, or if the model has no dependencies, the pipeline proceeds to execution.
  
  If any dependency is not satisfied, the model is skipped, and the pipeline moves to the next model.
- Execute Annotation Model:
  
  If dependencies are met, the pipeline executes the model's main() method. This is where model performs its task, for example, the EbookAnnotationModel extracts metadata like title, author, and publisher from an ebook file.
  
  If the model runs successfully, it returns a dictionary of data. If it fails, it can return None and set an error message.
Merge All Results
```
graph LR
    MergeResults(Merge All Results) --> End(End Scan);
```
After the loop is complete, the pipeline enters its final stage. The Merge All Results step collects the outputs from all models that ran successfully (including dorsal/base).

It then assembles them into the final annotations object in the File Record.
- The dictionary returned by each model becomes the record field of its corresponding annotation.
- The model's class properties (like id, version, and variant) are used to create the source field.
For example, the output from EbookAnnotationModel (which has id: "dorsal/ebook") is placed inside the File Record at annotations.file/ebook (as defined by its schema_id in the config).

Once all results are merged, the scan is complete, and the final File Record is returned.

Pipeline Configuration

The Annotation Model Pipeline is fully configurable.

The pipeline is defined by the [[model_pipeline]] array in the Dorsal config file.

The order of models in this array determines their execution order.

Each annotation model in the pipeline is prepresented as a TOML table. For example:

# --- PDF Annotation (Runs on PDF files) ---
[[model_pipeline]]
annotation_model = ["dorsal.file.annotation_models", "PDFAnnotationModel"]
schema_id = "file/pdf"
validation_model = ["dorsal.file.validators.pdf", "PDFValidationModel"]

# Dependencies: This model only runs on PDF files.
dependencies = [
    { type = "media_type", include = ["application/pdf"] },
]

Pipeline Config Fields

annotation_model: The Python class to execute.
schema_id: The key to use when saving the output to the annotations object. Also the name of the DorsalHub schema which validates the annotation content.
dependencies: (Optional) Rules that determine if the model should run.

Adding a Custom Model

If you want to add your own custom annotation model, you should use the dorsal.api.register_model function.

For a step by step tutirial, including adding a the model to an annotation pipeline, see Python Guide: Building Annotation Models

Manual Pipeline Edits Replace the Default

If you decide not to use the register_model function and you'd rather edit the config file directly, note that if you manually define a [[model_pipeline]] in your dorsal.toml, it will replace, the entire default pipeline; it won't extend it.

This is because the config in memory (the one Dorsal actually uses) is a combination of your global config file (which contains things like the default pipeline, and your auth information) and the project-level config file (which contains most other configurables you might wish to set). By setting [[model_pipeline]] in both, you are forcing Dorsal to choose between two pipelines, and the project-level config always has higher precedence.

To extend - rather than replace - the pipeline, first copy the full default pipeline into your project-level dorsal.toml first, and then append your custom model to that list.

You can find the full default pipeline in the Configuration Reference or in your global config file (e.g., /home/user/.dorsal/dorsal.toml on Linux, or C:\Users\user\.dorsal\dorsal.toml on Windows).