Annotation Model Pipeline
The Annotation Model Pipeline is a defined sequence of Annotation Models that run every time you tell Dorsal to extract metadata from a file, for example by running dorsal file scan or creating a new LocalFile instance:
When a file is scanned, under the hood Dorsal is executing its configurable Annotation Model Pipeline for that file.
This pipeline's output is collected and assembled into a File Record object: a highly structured, validated and typed object containing metadata for that file.
The graph below shows the high-level flow of this pipeline.
graph LR
%% === Main Graph Flow ===
Start(Start Scan) --> Model_Base(dorsal/base<br><b>Annotation Model</b>);
Model_Base --- sub_Start;
%% === Subgraph (LR flow) ===
subgraph PipelineLoop [For each Annotation Model in the Pipeline...]
direction LR
%% Invisible start/end nodes
sub_Start(( )) --> CheckDeps{"Dependency Check"};
CheckDeps -- All Satisfied / None --> RunModel(Execute<br>Annotation Model);
%% The "Skip" path
CheckDeps -- Dependency Not Met --- sub_End( );
%% The "Run" path
RunModel --- sub_End(( ));
end
%% *** FIX: Link from the *last node inside* the subgraph to the end ***
sub_End --> MergeResults(Merge All Results)
MergeResults --> End(End Scan);
Figure: Annotation Model Pipeline Execution Flow Graph
Pipeline Execution
Let's follow the execution of the graph from Start Scan to End Scan to understand what's happening:
-
The Base Model (
dorsal/base)graph LR Start(Start Scan) --> Model_Base(dorsal/base<br><b>Annotation Model</b>);Every scan, regardless of file type, begins by executing the
dorsal/baseAnnotation Model.This mandatory step is responsible for generating the
file/baseannotation. This annotation contains the most fundamental file information, such as:namesizemedia_type(e.g.,application/pdforimage/jpeg)extension
The output of this annotation model is useful, as the
media_typeandextensionit provides are often used by subsequent models in the Dependency Check step. -
The Pipeline Loop
graph LR subgraph PipelineLoop [For each Annotation Model in the Pipeline...] direction LR %% Invisible start/end nodes sub_Start(( )) --> CheckDeps{"Dependency Check"}; CheckDeps -- All Satisfied / None --> RunModel(Execute<br>Annotation Model); %% The "Skip" path CheckDeps -- Dependency Not Met --- sub_End( ); %% The "Run" path RunModel --- sub_End(( )); endAfter the
file/baseannotation model succeeds, the pipeline iterates through every other Annotation Model defined in its configuration. For each model, it follows the same path:-
Dependency Check:Before a model is executed, the pipeline checks if it has any dependencies. A dependency is a rule that determines if a model should run.
Dependencies include:
- Media Type: Does the file's
media_type(from thefile/baserecord) match a rule? (e.g., rundorsal/ebookonly forapplication/epub+zip). - File Extension: Does the file's
extensionmatch a list? (e.g.,.epub,.mobi).
If all dependencies are satisfied, or if the model has no dependencies, the pipeline proceeds to execution.
If any dependency is not satisfied, the model is skipped, and the pipeline moves to the next model.
- Media Type: Does the file's
-
Execute Annotation Model:If dependencies are met, the pipeline executes the model's
main()method. This is where model performs its task, for example, theEbookAnnotationModelextracts metadata like title, author, and publisher from an ebook file.If the model runs successfully, it returns a dictionary of data. If it fails, it can return
Noneand set an error message.
-
-
Merge All Results
graph LR MergeResults(Merge All Results) --> End(End Scan);After the loop is complete, the pipeline enters its final stage. The
Merge All Resultsstep collects the outputs from all models that ran successfully (includingdorsal/base).It then assembles them into the final
annotationsobject in the File Record.- The dictionary returned by each model becomes the
recordfield of its corresponding annotation. - The model's class properties (like
id,version, andvariant) are used to create thesourcefield.
For example, the output from
EbookAnnotationModel(which hasid: "dorsal/ebook") is placed inside theFile Recordatannotations.file/ebook(as defined by itsschema_idin the config).Once all results are merged, the scan is complete, and the final File Record is returned.
- The dictionary returned by each model becomes the
Pipeline Configuration
The Annotation Model Pipeline is fully configurable.
The pipeline is defined by the [[model_pipeline]] array in the Dorsal config file.
The order of models in this array determines their execution order.
Each annotation model in the pipeline is prepresented as a TOML table. For example:
# --- PDF Annotation (Runs on PDF files) ---
[[model_pipeline]]
annotation_model = ["dorsal.file.annotation_models", "PDFAnnotationModel"]
schema_id = "file/pdf"
validation_model = ["dorsal.file.validators.pdf", "PDFValidationModel"]
# Dependencies: This model only runs on PDF files.
dependencies = [
{ type = "media_type", include = ["application/pdf"] },
]
Pipeline Config Fields
annotation_model: The Python class to execute.schema_id: The key to use when saving the output to theannotationsobject. Also the name of the DorsalHub schema which validates the annotation content.dependencies: (Optional) Rules that determine if the model should run.
Adding a Custom Model
If you want to add your own custom annotation model, you should use the dorsal.api.register_model function.
For a step by step tutirial, including adding a the model to an annotation pipeline, see Python Guide: Building Annotation Models
Manual Pipeline Edits Replace the Default
If you decide not to use the register_model function and you'd rather edit the config file directly, note that if you manually define a [[model_pipeline]] in your dorsal.toml, it will replace, the entire default pipeline; it won't extend it.
This is because the config in memory (the one Dorsal actually uses) is a combination of your global config file (which contains things like the default pipeline, and your auth information) and the project-level config file (which contains most other configurables you might wish to set). By setting [[model_pipeline]] in both, you are forcing Dorsal to choose between two pipelines, and the project-level config always has higher precedence.
To extend - rather than replace - the pipeline, first copy the full default pipeline into your project-level dorsal.toml first, and then append your custom model to that list.
You can find the full default pipeline in the Configuration Reference or in your global config file (e.g., /home/user/.dorsal/dorsal.toml on Linux, or C:\Users\user\.dorsal\dorsal.toml on Windows).