Skip to content

Custom Annotation Models Part 1: Hello, Word!

The next three chapters are a guide to building your own custom Annotation Models; going from a toy example to a production-grade entity extractor.

  • Part 1 (This chapter): Introduces the Annotation Model. We build a toy model to learn the mechanics of working with the AnnotationModel class, as well as testing and pipeline integration.

  • Part 2: We will build a text classifier for PDF documents. We also ensure our model's output complies with a validation schema.

  • Part 3: We will build an entity extraction model, using Document Question Answering and an open-source LayoutLM model.

What is an Annotation Model?

Each File Record generated using dorsal (e.g. via the CLI or using the LocalFile class) is formed of one or more Annotations.

An Annotation is a structured sub-record generated by a specific Annotation Model.

An Annotation Model is simply a python class which defines two things:

  1. Rules to extract or derive metadata from a file
  2. The shape of the output record

Dorsal comes with a number of built-in Annotation Models. These form the default Annotation Model Pipeline

In this guide, we will build our own Annotation Model and add it to the pipeline.

Prerequisites & Setup

To follow along with this tutorial, you will need:

  • A Python Environment: Any IDE (VS Code, PyCharm, etc.) or a Jupyter notebook.
  • A Terminal: your terminal session should be open in the same project directory as your python environment.
  • dorsal: Instructions to create a virtual environment and install Dorsal are below:

1. Prepare the environment

Open a terminal and paste the following to create a directory and install Dorsal via the dorsalhub package on PyPI.

mkdir annotation-model-guide 
cd annotation-model-guide
python3 -m venv venv
source venv/bin/activate
pip install dorsalhub
mkdir annotation-model-guide 
cd annotation-model-guide
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install dorsalhub
mkdir annotation-model-guide 
cd annotation-model-guide
python -m venv venv
.\venv\Scripts\activate.bat
pip install dorsalhub

Note: Keep this terminal open to run dorsal commands

2. Launch your IDE

Open another terminal window/tab (or use your IDE's integrated terminal) to launch your coding environment.

cd annotation-model-guide
# Reactivate the venv in this new window
source venv/bin/activate  # or .\venv\Scripts\activate on Windows
pip install jupyterlab 
jupyter lab
  1. Launch VS Code.
  2. Open the annotation-model-guide folder (File > Open Folder).
  3. Open a Python file.
  4. Ensure your Python Interpreter (bottom right corner) is set to the venv you just created.

Annotation Model Structure

  • All Annotation Models are classes which inherit from AnnotationModel.

    This base class provides methods to handle logging and error reporting, and at run-time it makes data available to your model from "upstream" nodes in the pipeline.

  • All Annotation Models require a main method.

    Here is the simplest possible (valid) Annotation Model:

    Annotation Model which does nothing.
    1
    2
    3
    4
    5
    6
    7
    from dorsal import AnnotationModel
    
    class DoNothing(AnnotationModel):
        """This model does nothing."""
    
        def main(self):
            return None
    

    This model, if placed in a pipeline, would silently do nothing and be ignored.

    In order to do something the main method of the model must output something.

HelloWord

For the rest of this guide, we will build a simple text summary model called HelloWord.

This model's job is to count the words that appear in a text.

This simple model will allow us to demonstrate many of the core features of Annotation Models.

The logic at the core of HelloWord is very simple:

  1. Given a text, split it into individual words
  2. Count the words. Ignore any Stop Words. These are high frequency words, often excluded in NLP tasks (e.g. "in", "the", "and")
  3. Return a dictionary of the top 5 most frequent words in the text, ranked 1 through 5.

Counting the words in a text.

Here is a python script that represents the logic we want to capture:

Script to rank word frequency in a text
# 1. Define the words we want to ignore. This is a simplified list.
ENGLISH_STOPWORDS = {"the", "and", "is", "i", "my", "you", "if", "but", "a"}

# 2. Let's count the words from this extract from a 1910 English translation of 'The Art of War' by Sun Tzu
example_text = """
If you know the enemy and know yourself, you need not fear the result of a hundred battles.
If you know yourself but not the enemy, for every victory gained you will also suffer a defeat.
"""

# 1. Normalize and split into words
words = [w.strip(".,’").lower() for w in example_text.split()]

# 2. Filter out stopwords
significant_words = [w for w in words if w not in ENGLISH_STOPWORDS]

# 3. Rank the top 5 words in the text
data = {
    i+1: v[0] 
    for i, v in enumerate(Counter(significant_words).most_common(5))
}

print(data)

If you run this script, it will print the following:

{1: 'know', 2: 'enemy', 3: 'yourself', 4: 'not', 5: 'need'}

Where rank 1 for 'know' tells us that the word 'know' occurs more times in the text than any other.

Now, let's translate this simple logic into an Annotation Model.

The HelloWord model counts the top-k words (5 by default) in any text file:

HelloWord Annotation Model class
from collections import Counter
from dorsal import AnnotationModel
from dorsal.file.helpers import build_generic_record

ENGLISH_STOPWORDS = {
    "the", "and", "is", "in", "it", "of", "to", "a", "as", "that", 
    "which", "by", "its", "or", "be", "but", "not", "with", "for"
}
K = 5

class HelloWord(AnnotationModel):
    """Counts the top-k most common words in a text file."""

    def _get_text(self) -> str | None:
        try:
            with open(self.file_path, 'r', encoding='utf-8') as fp:
                return fp.read()
        except Exception as err:
            self.set_error(f"Failed to read file: {err}")
            return None

    def main(self) -> dict | None:
        text = self._get_text()
        if text is None:
            return None

        words = [w for w in text.lower().split() if w not in ENGLISH_STOPWORDS]
        if not words:
            self.set_error("No significant words found.")
            return None

        data = {str(i+1): v[0] for i, v in enumerate(Counter(words).most_common(K))}
        return build_generic_record(
            description=f"Top {K} most common words",
            data=data
        )

The main method

Let's look more closely at its main method. This is the entry-point for all Annotation Models:

from collections import Counter
from dorsal import AnnotationModel
from dorsal.file.helpers import build_generic_record

ENGLISH_STOPWORDS = {
    "the", "and", "is", "in", "it", "of", "to", "a", "as", "that", 
    "which", "by", "its", "or", "be", "but", "not", "with", "for"
}
K = 5

class HelloWord(AnnotationModel):
    """Counts the top-k most common words in a text file."""

    def _get_text(self) -> str | None:
        try:
            with open(self.file_path, 'r', encoding='utf-8') as fp:
                return fp.read()
        except Exception as err:
            self.set_error(f"Failed to read file: {err}")
            return None

    def main(self) -> dict | None:
        text = self._get_text()
        if text is None:
            return None

        words = [w for w in text.lower().split() if w not in ENGLISH_STOPWORDS]
        if not words:
            self.set_error("No significant words found.")
            return None

        data = {str(i+1): v[0] for i, v in enumerate(Counter(words).most_common(K))}
        return build_generic_record(
            description=f"Top {K} most common words",
            data=data
        )

The goal of any Annotation Model's main method is to return a fully-formed Annotation Record - a schema-validated dictionary which describes something about the file.

  • The HelloWord model returns a dictionary which conforms to the open/generic validation schema (This is a permissive schema - allowing flat dictionaries with arbitrary key names - so is suitable for this toy example).

  • If the main method cannot return an Annotation Record for any reason (e.g. it failed to read the file) it should return None.

  • The HelloWord model has two possible fail conditions:

    1. Text extraction fails (_get_text returns None)
    2. There are no significant words in the text.
  • Whenever the main method returns None, we should set an error message using the set_error method. This helps the pipeline keep track of what went wrong.

Helper methods and functions

To keep main readable, it's often a good idea to move any self-contained logic to helper methods and functions.

  • The _get_text method accesses the file via the file_path attribute.

  • It attempts to retrieve the content of the file as a single long string. If it fails, it sets an error and returns None:

from collections import Counter
from dorsal import AnnotationModel
from dorsal.file.helpers import build_generic_record

ENGLISH_STOPWORDS = {
    "the", "and", "is", "in", "it", "of", "to", "a", "as", "that", 
    "which", "by", "its", "or", "be", "but", "not", "with", "for"
}
K = 5

class HelloWord(AnnotationModel):
    """Counts the top-k most common words in a text file."""

    def _get_text(self) -> str | None:
        try:
            with open(self.file_path, 'r', encoding='utf-8') as fp:
                return fp.read()
        except Exception as err:
            self.set_error(f"Failed to read file: {err}")
            return None

    def main(self) -> dict | None:
        text = self._get_text()
        if text is None:
            return None

        words = [w for w in text.lower().split() if w not in ENGLISH_STOPWORDS]
        if not words:
            self.set_error("No significant words found.")
            return None

        data = {str(i+1): v[0] for i, v in enumerate(Counter(words).most_common(K))}
        return build_generic_record(
            description=f"Top {K} most common words",
            data=data
        )
from collections import Counter
from dorsal import AnnotationModel
from dorsal.file.helpers import build_generic_record

ENGLISH_STOPWORDS = {
    "the", "and", "is", "in", "it", "of", "to", "a", "as", "that", 
    "which", "by", "its", "or", "be", "but", "not", "with", "for"
}
K = 5

class HelloWord(AnnotationModel):
    """Counts the top-k most common words in a text file."""

    def _get_text(self) -> str | None:
        try:
            with open(self.file_path, 'r', encoding='utf-8') as fp:
                return fp.read()
        except Exception as err:
            self.set_error(f"Failed to read file: {err}")
            return None

    def main(self) -> dict | None:
        text = self._get_text()
        if text is None:
            return None

        words = [w for w in text.lower().split() if w not in ENGLISH_STOPWORDS]
        if not words:
            self.set_error("No significant words found.")
            return None

        data = {str(i+1): v[0] for i, v in enumerate(Counter(words).most_common(K))}
        return build_generic_record(
            description=f"Top {K} most common words",
            data=data
        )

Available File Attributes

When the pipeline runs your model, it automatically populates several instance-level (i.e. on self) attributes, before calling main(). You can access these in your code:

  • self.file_path (str): The absolute path to the file on disk.
  • self.media_type (str): The detected Media Type of the file (e.g. application/pdf, text/plain).
  • self.hash (str): The SHA-256 hash of the file.
  • self.size (int): The file size in bytes.
  • self.name (str): The filename (e.g. document.txt).
  • self.extension (str): The file extension (e.g. .txt).

Summary

  1. An Annotation Model is a python class inheriting from AnnotationModel. Its job is to create a structured metadata dictionary - an Annotation Record - for a file.

    • HelloWord uses simple logic to create an annotation record.
    • The annotation record it outputs is a dictionary, ranking words by how frequently they appear in a text.
  2. The main method is the model's entry and exit. Most of the time you'll read a file via the file_path attribute and do something in your main to infer/generate the returned annotation record.

    • HelloWord uses a _get_text helper method to read the file content safely.
    • HelloWord sets a coherent error message and returns None if reading fails.
  3. The return value from the main method must be a dictionary. To be validated it must conform to a known validation schema.

    • HelloWord creates an Annotation Record which validates against the open/generic annotation schema.
    • The build_generic_record helper makes it easy to build records which match the schema.

Testing the Model

  • Before we add the model to our pipeline, we should test it to make sure it works.

  • Import the run_model function from dorsal.testing.

  • run_model simulates running our model in a pipeline. We can point it at a text file on our system and it will give us a fully-formed schema-compliant annotation.

  • Let's use it to find out what the most common words are in a 1904 English translation of A Contribution to the Critique of Political Economy by Karl Marx:

Testing an Annotation Model
1
2
3
4
5
6
7
8
9
from dorsal.testing import run_model

result = run_model(
    annotation_model=HelloWord,
    file_path="./test/books/pg46423.txt",
    schema_id="open/generic"
)

print(result.model_dump_json(indent=2))
  • Notice how we provide our Annotation Model class as an argument to the function (we don't need to instantiate it ourselves)

  • We also provide the file_path, pointing to the file we want to test on, and a schema_id naming the schema we want to validate against.

{
  "name": "HelloWord",
  "source": {
    "type": "Model",
    "model": "HelloWord",
    "version": null,
    "variant": null
  },
  "record": {
    "data": {
      "1": "money",
      "2": "gold",
      "3": "value",
      "4": "exchange",
      "5": "commodities"
    },
    "description": "Top 5 most common words"
  },
  "schema_id": "open/generic",
  "schema_version": "1.0.0",
  "time_taken": 0.03382979903835803,
  "error": null
}

The result returned by run_model is a RunModelResult instance, which has the following attributes

  • name: The name of the model
  • source: An object which provides more info about the model. These are class variables you can set in your Annotation Model.
  • record: The annotation record itself
  • schema_id: The name of the schema which validated the record
  • time_taken: The execution time for the model
  • error: If there was a problem, this field will contain the error message set by the model's set_error method.

Dependencies

run_model also allows us to simulate pipeline dependencies.

Dependencies are configuration values which tell the pipeline when to run the model.

Since our HelloWord model only works with text files, let's create a dependency which prevents it from running when any other kind of file is processed by the pipeline:

Testing an Annotation Model with dependencies
from dorsal.file.dependencies import make_media_type_dependency
from dorsal.testing import run_model

dependency = make_media_type_dependency(include=["text"])
dependencies = [dependency]  # We are only creating one, but the function expects a sequence

result = run_model(
    annotation_model=HelloWord,
    file_path="./test/documents/expenses.pdf",
    dependencies=dependencies,
    schema_id="open/generic"
)

print(result.model_dump_json(indent=2))
  • Using make_media_type_dependency we can provide full or partial media types e.g. application, application/pdf, text, text/plain, etc.

  • In this example make_media_type_dependency(include=["text"]) means the model will ignore all documents where the media type is not text e.g. text/plain or text/xml etc.

  • The example above, which runs the model on a PDF document, correctly skips:

{
  "name": "HelloWord",
  "source": {
    "type": "Model",
    "model": "HelloWord",
    "version": null,
    "variant": null
  },
  "record": null,
  "schema_id": "open/generic",
  "time_taken": null,
  "error": "Skipped: Dependency not met: media_type"
}

See Reference: Annotation Models for more information on dependencies.

Adding it to the Pipeline

Once you're satisfied your model works, you can add it to the Annotation Model Pipeline.

Before we modify it, let's look at the default pipeline. We can do this in python by calling show_model_pipeline, or in the dorsal CLI by running dorsal config pipeline show:

╭───────────────────────────────────────────── Annotation Model Pipeline ────────────────────────────────────────────────╮
│   Idx    Status        Model Name                    Module                           Schema ID         Dependencies   │
│     0    Default    FileCoreAnnotationModel          dorsal.file.annotation_models    file/base         None           │
│     1    Active     MediaInfoAnnotationModel         dorsal.file.annotation_models    file/mediainfo    media_type     │
│     2    Active     PDFAnnotationModel               dorsal.file.annotation_models    file/pdf          media_type     │
│     3    Active     EbookAnnotationModel             dorsal.file.annotation_models    file/ebook        media_type     │
│     4    Active     OfficeDocumentAnnotationModel    dorsal.file.annotation_models    file/office       media_type     │
╰────────────────────────────────────────────────── Total Models: 5 ─────────────────────────────────────────────────────╯

To add our model to the pipeline, we use the register_model function from dorsal.api:

Model must be importable

If you have been following this tutorial so far in Jupyter, make sure to move your HelloWord class to a .py file before registering it to the pipeline.

Registering a model copies its import path to your project config file, so it must be defined outside of __main__.

e.g. from helloword import HelloWord where helloword.py is in the same directory as your main script/notebook.

Registering a Model to the Annotation Model Pipeline
from dorsal.api import register_model
from dorsal.testing import make_media_type_dependency
from helloword import HelloWord  # The model must be importable to be registered

dependencies = [
    make_media_type_dependency(include=["text"])
]

register_model(
    annotation_model=HelloWord,
    schema_id="open/generic",
    dependencies=dependencies
)

The register_model function doesn't produce an output, but if you have logging enabled in your environment you will see something like:

INFO:dorsal.file.pipeline_config:Appended new model 'helloword.HelloWord' to project config.

Then if you call show_model_pipeline again (or run dorsal config pipeline show in the CLI), you'll see HelloWord has been added:

╭───────────────────────────────────────────── Annotation Model Pipeline ────────────────────────────────────────────────╮
│   Idx    Status        Model Name                    Module                           Schema ID         Dependencies   │
│     0    Default    FileCoreAnnotationModel          dorsal.file.annotation_models    file/base         None           │
│     1    Active     MediaInfoAnnotationModel         dorsal.file.annotation_models    file/mediainfo    media_type     │
│     2    Active     PDFAnnotationModel               dorsal.file.annotation_models    file/pdf          media_type     │
│     3    Active     EbookAnnotationModel             dorsal.file.annotation_models    file/ebook        media_type     │
│     4    Active     OfficeDocumentAnnotationModel    dorsal.file.annotation_models    file/office       media_type     │
│     5    Active     HelloWord                        helloword                        open/generic      media_type     │
╰────────────────────────────────────────────────── Total Models: 6 ─────────────────────────────────────────────────────╯

Testing the pipeline

Now that it's integrated, every relevant file you process with Dorsal (e.g. dorsal file scan in the CLI, or LocalFile to extract metadata in python) will include an annotation record generated and added by our model:

from dorsal import LocalFile

lf = LocalFile("./test/books/pg46423.txt", overwrite_cache=True)
print(lf.to_json())

Prints the full File Record including our new model's annotation:

{
  "hash": "b8325d65df9d5570a23f09c83efe6ef1a61031b178622284093293898ec96168",
  "validation_hash": "7a262c6ede2fde86442708ce7235517cd6e31c7ccbbbb9a2c119f51b67c4e059",
  "annotations": {
    "file/base": {
      "record": {
        "hash": "b8325d65df9d5570a23f09c83efe6ef1a61031b178622284093293898ec96168",
        "name": "pg46423.txt",
        "extension": ".txt",
        "size": 541171,
        "media_type": "text/plain",
        "media_type_prefix": "text"
      },
      "source": {
        "type": "Model",
        "model": "dorsal/base",
        "version": "1.0.0"
      }
    },
    "open/generic": {
      "record": {
        "description": "Top 5 most common words",
        "data": {
          "1": "money",
          "2": "gold",
          "3": "value",
          "4": "exchange",
          "5": "commodities"
        }
      },
      "private": true,
      "source": {
        "type": "Model",
        "model": "HelloWord"
      },
      "schema_version": "1.0.0"
    }
  },
  "tags": [],
  "source": "disk",
  "local_attributes": {
    "date_modified": "2025-11-18 13:53:58.005877+00:00",
    "date_accessed": "2025-11-25 08:39:42.361173+00:00",
    "date_created": "2025-11-18 13:53:58.332433+00:00",
    "file_path": "/dev/test/books/pg46423.txt",
    "file_size_bytes": 541171,
    "file_permissions_mode": 33279,
    "inode": 45035996274366182,
    "number_of_links": 1
  }
}
dorsal file scan ./test/books/pg46423.txt --overwrite-cache

Displays the full File Record including our new model's annotation:

📄 Scanning metadata for pg46423.txt
╭─────────────────────── File Record: pg46423.txt (from cache) ────────────────────────╮
│                                                                                      │
│  Hashes                                                                              │
│       SHA-256:  b8325d65df9d5570a23f09c83efe6ef1a61031b178622284093293898ec96168     │
│        BLAKE3:  7a262c6ede2fde86442708ce7235517cd6e31c7ccbbbb9a2c119f51b67c4e059     │
│                                                                                      │
│  File Info                                                                           │
│     Full Path:  /dev/test/books/pg46423.txt                                          │
│      Modified:  2025-11-18 13:53:58                                                  │
│          Name:  pg46423.txt                                                          │
│          Size:  528 KiB                                                              │
│    Media Type:  text/plain                                                           │
│                                                                                      │
│  Tags                                                                                │
│        No tags found.                                                                │
│                                                                                      │
│  Generic Info                                                                        │
│       file_hash:  b8325d65df9d5570a23f09c83efe6ef1a61031b178622284093293898ec96168   │
│     description:  Top 5 most common words                                            │
│            data:                                                                     │
│               1:  money                                                              │
│               2:  gold                                                               │
│               3:  value                                                              │
│               4:  exchange                                                           │
│               5:  commodities                                                        │
│                                                                                      │
│                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────╯

Overwrite Cache

In the example above, notice how we pass the argument --overwrite-cache (CLI) / overwrite_cache=True (Python).

This is because the local Dorsal File Record Cache already contains an entry from earlier.

To avoid retrieving the record, the --overwrite-cache/overwrite_cache argument does two things:

  • Forces a full run of the Annotation Model Pipeline on the file, using our new model
  • Writes-back the updated record to the cache

That way, when we retrieve it again without passing --overwrite-cache/overwrite_cache, we get the most up to date result.

Cleanup

Since HelloWord is just a toy model, you probably don't want it running on your files forever.

You can remove it from your pipeline using the CLI command dorsal config pipeline remove:

dorsal config pipeline remove HelloWord

Alternatively, call the remove_model_by_name python function:

1
2
3
from dorsal.api import remove_model_by_name

remove_model_by_name("HelloWord")

Note: these commands remove the entry for the new model from your project-level dorsal.toml config file, but do not modify the content of the python file containing your model class.

Summary

You have now successfully:

  1. Created a custom Annotation Model.
  2. Tested it in isolation using run_model.
  3. Registered it to the pipeline.
  4. Tested it in the pipeline.
  5. Cleaned up your environment.

In the next part of this series, we'll build a classification model.

➡️ Continue to 5. Custom Annotation Models Part 2: Classification