Custom Annotation Models Part 1: Hello, Word!
The next three chapters are a guide to building your own custom Annotation Models; going from a toy example to a production-grade entity extractor.
-
Part 1 (This chapter): Introduces the Annotation Model. We build a toy model to learn the mechanics of working with the
AnnotationModelclass, as well as testing and pipeline integration. -
Part 2: We will build a text classifier for PDF documents. We also ensure our model's output complies with a validation schema.
-
Part 3: We will build an entity extraction model, using Document Question Answering and an open-source LayoutLM model.
What is an Annotation Model?
Each File Record generated using dorsal (e.g. via the CLI or using the LocalFile class) is formed of one or more Annotations.
An Annotation is a structured sub-record generated by a specific Annotation Model.
An Annotation Model is simply a python class which defines two things:
- Rules to extract or derive metadata from a file
- The shape of the output record
Dorsal comes with a number of built-in Annotation Models. These form the default Annotation Model Pipeline
In this guide, we will build our own Annotation Model and add it to the pipeline.
Prerequisites & Setup
To follow along with this tutorial, you will need:
- A Python Environment: Any IDE (VS Code, PyCharm, etc.) or a Jupyter notebook.
- A Terminal: your terminal session should be open in the same project directory as your python environment.
dorsal: Instructions to create a virtual environment and install Dorsal are below:
1. Prepare the environment
Open a terminal and paste the following to create a directory and install Dorsal via the dorsalhub package on PyPI.
Note: Keep this terminal open to run
dorsalcommands
2. Launch your IDE
Open another terminal window/tab (or use your IDE's integrated terminal) to launch your coding environment.
- Launch VS Code.
- Open the
annotation-model-guidefolder (File > Open Folder). - Open a Python file.
- Ensure your Python Interpreter (bottom right corner) is set to the
venvyou just created.
Annotation Model Structure
-
All Annotation Models are classes which inherit from
AnnotationModel.This base class provides methods to handle logging and error reporting, and at run-time it makes data available to your model from "upstream" nodes in the pipeline.
-
All Annotation Models require a
mainmethod.Here is the simplest possible (valid) Annotation Model:
Annotation Model which does nothing. This model, if placed in a pipeline, would silently do nothing and be ignored.
In order to do something the
mainmethod of the model must output something.
HelloWord
For the rest of this guide, we will build a simple text summary model called HelloWord.
This model's job is to count the words that appear in a text.
This simple model will allow us to demonstrate many of the core features of Annotation Models.
The logic at the core of HelloWord is very simple:
- Given a text, split it into individual words
- Count the words. Ignore any Stop Words. These are high frequency words, often excluded in NLP tasks (e.g.
"in","the","and") - Return a dictionary of the top 5 most frequent words in the text, ranked 1 through 5.
Counting the words in a text.
Here is a python script that represents the logic we want to capture:
# 1. Define the words we want to ignore. This is a simplified list.
ENGLISH_STOPWORDS = {"the", "and", "is", "i", "my", "you", "if", "but", "a"}
# 2. Let's count the words from this extract from a 1910 English translation of 'The Art of War' by Sun Tzu
example_text = """
If you know the enemy and know yourself, you need not fear the result of a hundred battles.
If you know yourself but not the enemy, for every victory gained you will also suffer a defeat.
"""
# 1. Normalize and split into words
words = [w.strip(".,’").lower() for w in example_text.split()]
# 2. Filter out stopwords
significant_words = [w for w in words if w not in ENGLISH_STOPWORDS]
# 3. Rank the top 5 words in the text
data = {
i+1: v[0]
for i, v in enumerate(Counter(significant_words).most_common(5))
}
print(data)
If you run this script, it will print the following:
Where rank 1 for 'know' tells us that the word 'know' occurs more times in the text than any other.
Now, let's translate this simple logic into an Annotation Model.
The HelloWord model counts the top-k words (5 by default) in any text file:
The main method
Let's look more closely at its main method. This is the entry-point for all Annotation Models:
The goal of any Annotation Model's main method is to return a fully-formed Annotation Record - a schema-validated dictionary which describes something about the file.
-
The
HelloWordmodel returns a dictionary which conforms to theopen/genericvalidation schema (This is a permissive schema - allowing flat dictionaries with arbitrary key names - so is suitable for this toy example). -
If the
mainmethod cannot return an Annotation Record for any reason (e.g. it failed to read the file) it should returnNone. -
The
HelloWordmodel has two possible fail conditions:- Text extraction fails (
_get_textreturnsNone) - There are no significant words in the text.
- Text extraction fails (
-
Whenever the
mainmethod returnsNone, we should set an error message using theset_errormethod. This helps the pipeline keep track of what went wrong.
Helper methods and functions
To keep main readable, it's often a good idea to move any self-contained logic to helper methods and functions.
-
The
_get_textmethod accesses the file via thefile_pathattribute. -
It attempts to retrieve the content of the file as a single long string. If it fails, it sets an error and returns
None:
-
The
build_generic_recordfunction is included in thedorsal.file.helpersmodule -
build_generic_recordhelps make a well-formattedopen/genericdictionary, which is what we want to return from ourmain. -
We must provide it with a flat
datadictionary and (optional)descriptionfield to satisfy the validation requirements of theopen/genericschema.
Available File Attributes
When the pipeline runs your model, it automatically populates several instance-level (i.e. on self) attributes, before calling main(). You can access these in your code:
self.file_path(str): The absolute path to the file on disk.self.media_type(str): The detected Media Type of the file (e.g.application/pdf,text/plain).self.hash(str): The SHA-256 hash of the file.self.size(int): The file size in bytes.self.name(str): The filename (e.g.document.txt).self.extension(str): The file extension (e.g..txt).
Summary
-
An Annotation Model is a python class inheriting from
AnnotationModel. Its job is to create a structured metadata dictionary - an Annotation Record - for a file.HelloWorduses simple logic to create an annotation record.- The annotation record it outputs is a dictionary, ranking words by how frequently they appear in a text.
-
The
mainmethod is the model's entry and exit. Most of the time you'll read a file via thefile_pathattribute and do something in yourmainto infer/generate the returned annotation record.HelloWorduses a_get_texthelper method to read the file content safely.HelloWordsets a coherent error message and returnsNoneif reading fails.
-
The return value from the
mainmethod must be a dictionary. To be validated it must conform to a known validation schema.HelloWordcreates an Annotation Record which validates against theopen/genericannotation schema.- The
build_generic_recordhelper makes it easy to build records which match the schema.
Testing the Model
-
Before we add the model to our pipeline, we should test it to make sure it works.
-
Import the
run_modelfunction fromdorsal.testing. -
run_modelsimulates running our model in a pipeline. We can point it at a text file on our system and it will give us a fully-formed schema-compliant annotation. -
Let's use it to find out what the most common words are in a 1904 English translation of A Contribution to the Critique of Political Economy by Karl Marx:
| Testing an Annotation Model | |
|---|---|
-
Notice how we provide our Annotation Model class as an argument to the function (we don't need to instantiate it ourselves)
-
We also provide the
file_path, pointing to the file we want to test on, and aschema_idnaming the schema we want to validate against.
{
"name": "HelloWord",
"source": {
"type": "Model",
"model": "HelloWord",
"version": null,
"variant": null
},
"record": {
"data": {
"1": "money",
"2": "gold",
"3": "value",
"4": "exchange",
"5": "commodities"
},
"description": "Top 5 most common words"
},
"schema_id": "open/generic",
"schema_version": "1.0.0",
"time_taken": 0.03382979903835803,
"error": null
}
The result returned by run_model is a RunModelResult instance, which has the following attributes
name: The name of the modelsource: An object which provides more info about the model. These are class variables you can set in your Annotation Model.record: The annotation record itselfschema_id: The name of the schema which validated the recordtime_taken: The execution time for the modelerror: If there was a problem, this field will contain the error message set by the model'sset_errormethod.
Dependencies
run_model also allows us to simulate pipeline dependencies.
Dependencies are configuration values which tell the pipeline when to run the model.
Since our HelloWord model only works with text files, let's create a dependency which prevents it from running when any other kind of file is processed by the pipeline:
-
Using
make_media_type_dependencywe can provide full or partial media types e.g.application,application/pdf,text,text/plain, etc. -
In this example
make_media_type_dependency(include=["text"])means the model will ignore all documents where the media type is nottexte.g.text/plainortext/xmletc. -
The example above, which runs the model on a PDF document, correctly skips:
{
"name": "HelloWord",
"source": {
"type": "Model",
"model": "HelloWord",
"version": null,
"variant": null
},
"record": null,
"schema_id": "open/generic",
"time_taken": null,
"error": "Skipped: Dependency not met: media_type"
}
See Reference: Annotation Models for more information on dependencies.
Adding it to the Pipeline
Once you're satisfied your model works, you can add it to the Annotation Model Pipeline.
Before we modify it, let's look at the default pipeline. We can do this in python by calling show_model_pipeline, or in the dorsal CLI by running dorsal config pipeline show:
╭───────────────────────────────────────────── Annotation Model Pipeline ────────────────────────────────────────────────╮
│ Idx Status Model Name Module Schema ID Dependencies │
│ 0 Default FileCoreAnnotationModel dorsal.file.annotation_models file/base None │
│ 1 Active MediaInfoAnnotationModel dorsal.file.annotation_models file/mediainfo media_type │
│ 2 Active PDFAnnotationModel dorsal.file.annotation_models file/pdf media_type │
│ 3 Active EbookAnnotationModel dorsal.file.annotation_models file/ebook media_type │
│ 4 Active OfficeDocumentAnnotationModel dorsal.file.annotation_models file/office media_type │
╰────────────────────────────────────────────────── Total Models: 5 ─────────────────────────────────────────────────────╯
To add our model to the pipeline, we use the register_model function from dorsal.api:
Model must be importable
If you have been following this tutorial so far in Jupyter, make sure to move your HelloWord class to a .py file before registering it to the pipeline.
Registering a model copies its import path to your project config file, so it must be defined outside of __main__.
e.g. from helloword import HelloWord where helloword.py is in the same directory as your main script/notebook.
The register_model function doesn't produce an output, but if you have logging enabled in your environment you will see something like:
Then if you call show_model_pipeline again (or run dorsal config pipeline show in the CLI), you'll see HelloWord has been added:
╭───────────────────────────────────────────── Annotation Model Pipeline ────────────────────────────────────────────────╮
│ Idx Status Model Name Module Schema ID Dependencies │
│ 0 Default FileCoreAnnotationModel dorsal.file.annotation_models file/base None │
│ 1 Active MediaInfoAnnotationModel dorsal.file.annotation_models file/mediainfo media_type │
│ 2 Active PDFAnnotationModel dorsal.file.annotation_models file/pdf media_type │
│ 3 Active EbookAnnotationModel dorsal.file.annotation_models file/ebook media_type │
│ 4 Active OfficeDocumentAnnotationModel dorsal.file.annotation_models file/office media_type │
│ 5 Active HelloWord helloword open/generic media_type │
╰────────────────────────────────────────────────── Total Models: 6 ─────────────────────────────────────────────────────╯
Testing the pipeline
Now that it's integrated, every relevant file you process with Dorsal (e.g. dorsal file scan in the CLI, or LocalFile to extract metadata in python) will include an annotation record generated and added by our model:
from dorsal import LocalFile
lf = LocalFile("./test/books/pg46423.txt", overwrite_cache=True)
print(lf.to_json())
Prints the full File Record including our new model's annotation:
Displays the full File Record including our new model's annotation:
📄 Scanning metadata for pg46423.txt
╭─────────────────────── File Record: pg46423.txt (from cache) ────────────────────────╮
│ │
│ Hashes │
│ SHA-256: b8325d65df9d5570a23f09c83efe6ef1a61031b178622284093293898ec96168 │
│ BLAKE3: 7a262c6ede2fde86442708ce7235517cd6e31c7ccbbbb9a2c119f51b67c4e059 │
│ │
│ File Info │
│ Full Path: /dev/test/books/pg46423.txt │
│ Modified: 2025-11-18 13:53:58 │
│ Name: pg46423.txt │
│ Size: 528 KiB │
│ Media Type: text/plain │
│ │
│ Tags │
│ No tags found. │
│ │
│ Generic Info │
│ file_hash: b8325d65df9d5570a23f09c83efe6ef1a61031b178622284093293898ec96168 │
│ description: Top 5 most common words │
│ data: │
│ 1: money │
│ 2: gold │
│ 3: value │
│ 4: exchange │
│ 5: commodities │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────╯
Overwrite Cache
In the example above, notice how we pass the argument --overwrite-cache (CLI) / overwrite_cache=True (Python).
This is because the local Dorsal File Record Cache already contains an entry from earlier.
To avoid retrieving the record, the --overwrite-cache/overwrite_cache argument does two things:
- Forces a full run of the
Annotation Model Pipelineon the file, using our new model - Writes-back the updated record to the cache
That way, when we retrieve it again without passing --overwrite-cache/overwrite_cache, we get the most up to date result.
Cleanup
Since HelloWord is just a toy model, you probably don't want it running on your files forever.
You can remove it from your pipeline using the CLI command dorsal config pipeline remove:
Alternatively, call the remove_model_by_name python function:
Note: these commands remove the entry for the new model from your project-level dorsal.toml config file, but do not modify the content of the python file containing your model class.
Summary
You have now successfully:
- Created a custom Annotation Model.
- Tested it in isolation using
run_model. - Registered it to the pipeline.
- Tested it in the pipeline.
- Cleaned up your environment.
In the next part of this series, we'll build a classification model.
➡️ Continue to 5. Custom Annotation Models Part 2: Classification