Skip to content

Custom Annotation Models Part 2: Classification

This is Part 2 of a 3-part series series on building your own Annotation Models.

In Part 1: Hello, Word!:

  • We were introduced to the AnnotationModel class
  • We built a simple model, subclassing AnnotationModel, to count words in a text file
  • We tested the model with dependencies using the dorsal.testing module
  • We registered the model to our Annotation Model Pipeline
  • Finally we removed the model (because it's not very useful to keep)

In this chapter, we will build an Annotation Model which labels PDF documents based on their content.


SensitiveDocumentScanner

We are going to build a simple PDF document classification model called SensitiveDocumentScanner.

This model adds labels to documents if any "sensitive" key words and phrases (like "confidential" or "internal use only") are found in the text.

It outputs an open/classification annotation record - a suitable format for storing labels for documents.

Here is how a schema-compliant annotation record might look, for a record processed by the SensitiveDocumentScanner:

{
  "labels": [
    {
      "label": "Internal"
    }
  ],
  "vocabulary": [
    "Confidential",
    "Internal"
  ]
}
  • This is a "positive" example (meaning one or more labels was applied)
  • We include the optional vocabulary field as it makes the annotation self-contained and understandable.

The logic that SensitiveDocumentScanner captures is very simple:

  • Define a set of key words and phrases and their associated labels (e.g. "internal use only" -> Internal)
  • Scan the text of a PDF document, and look for any of these terms
  • If any are found, apply the relevant label to the document

This approach is quite general and can be adapted to many use-cases.

For simplicity, this guide uses a simple string check, but it could easily be adapted to use more sophisticated approaches.


Here is the complete code for the model:

Sensitive Document Scanner Annotation Model
from dorsal import AnnotationModel
from dorsal.file.helpers import build_classification_record
from dorsal.file.preprocessing import extract_pdf_text

SENSITIVE_LABELS = {
    "Confidential": ["confidential", "do not distribute", "private"],
    "Internal": ["internal use only", "proprietary"],
}

class SensitiveDocumentScanner(AnnotationModel):
    id: str = "github:dorsalhub/annotation-model-examples"
    version: str = "1.0.0"

    def main(self) -> dict | None:
        try:
            pages = extract_pdf_text(self.file_path)
        except Exception as err:
            self.set_error(f"Failed to parse PDF: {err}")
            return None

        matches = set()
        for text in pages:
            text = text.lower()
            for label, keywords in SENSITIVE_LABELS.items():
                if any(k in text for k in keywords):
                    matches.add(label)

        return build_classification_record(
            labels=list(matches),
            vocabulary=list(SENSITIVE_LABELS.keys())
        )
Schema Focus: open/classification

Unlike tagging, which simply applies a label to a file, classification tasks usually involve well-defined constraints.

The open/classification schema represent both the result (the applied labels) and the context (e.g. possible labels, confidence in the result).

By treating classification as a structured record rather than a loose collection of keywords, we ensure downstream systems can reliably interpret the data - whether it comes from a simple script, a human-in-the-loop workflow, or a machine learning model.

We can see these advantages in action when we apply the schema to our SensitiveDocumentScanner:

  1. Vocabulary:

    open/classification supports the vocabulary field - a list of all possible labels for the classification task.

    This removes ambiguity, makes positive results self-contained, and makes even a negative result meaningful:

    {
      "labels": [],
      "vocabulary": ["Confidential", "Internal"] 
    }
    
    This record tells us: "We checked for 'Confidential' and 'Internal' markers, and they are NOT present."

  2. Confidence Score:

    open/classification labels support confidence scoring (how certain the model is) via the score and score_explanation fields.

    Our string-matching approach in SensitiveDocumentScanner either finds a keyword or it doesn't: in this context a "confidence score" is difficult to define.

    However, if instead we were to use a pre-trained machine learning model for our inference, including a probabilistic score adds value for downstream consumers of the annotation:

    {
      "score_explanation": "Model confidence probability [0, 1]",
      "labels": [
        { "label": "Internal", "score": 0.98 } 
      ]
    }
    
    This standardization allows downstream systems to support any classification result, regardless of task or model complexity, without requiring code changes.

  3. Attributes

    The schema supports model-specific metadata via the attributes object, supporting additional context within the existing validation structure.

    While SensitiveDocumentScanner applies a label to the whole file, we could easily extend it to indicate where the sensitive content was found:

    {
      "labels": [
        {
            "label": "Confidential",
            "attributes": {
                "page_number": 4,
                "snippet": "DO NOT DISTRIBUTE"
            }
        }
      ]
    }
    

The main method

  • We use extract_pdf_text to retrieve the raw text content from the document as a list of strings, where each string is the text of a single page.

  • We then use a simple loop to go over each page, checking for any of our defined "sensitive" key words or phrases.

  • Finally, we call build_classification_record to create an open/classification schema-compliant record containing the result.

  • Note that, even in the case where no labels were found, we may wish to return a record including the vocabulary (the full list of labels) to show that the document was processed by the model, and no labels were found.

Sensitive Document Scanner Annotation Model
from dorsal import AnnotationModel
from dorsal.file.helpers import build_classification_record
from dorsal.file.preprocessing import extract_pdf_text

SENSITIVE_LABELS = {
    "Confidential": ["confidential", "do not distribute", "private"],
    "Internal": ["internal use only", "proprietary"],
}

class SensitiveDocumentScanner(AnnotationModel):
    id: str = "github:dorsalhub/annotation-model-examples"
    version: str = "1.0.0"

    def main(self) -> dict | None:
        try:
            pages = extract_pdf_text(self.file_path)
        except Exception as err:
            self.set_error(f"Failed to parse PDF: {err}")
            return None

        matches = set()
        for text in pages:
            text = text.lower()
            for label, keywords in SENSITIVE_LABELS.items():
                if any(k in text for k in keywords):
                    matches.add(label)

        return build_classification_record(
            labels=list(matches),
            vocabulary=list(SENSITIVE_LABELS.keys())
        )
File type flexibility

The SensitiveDocumentScanner example is designed to label the text content of PDF documents.

We could easily open this model up to other document types by adding additional calls to helpers to extract text from other file types.

At run-time, attributes such as media_type, extension and name are available on your annotation model

You could make use of these to perform conditional checks and use different text extraction logic depending on the document type:

def main(self) -> dict | None:
    # If it's a PDF document
    if self.media_type == "application/pdf":
        try:
            pages = extract_pdf_text(self.file_path)
        except Exception as err:
            self.set_error(f"Failed to parse PDF: {err}")
            return None
    # If it's a text file:
    elif self.media_type.startswith("text/"):
        try:
            with open(self.file_path, "r", encoding="utf-8") as fp:
                pages = [fp.read()]  # most text files don't have clear page markers, so treat it as one long page
        except Exception as err:
            self.set_error(f"Failed to parse text file: {err}")
            return None
    # If it's a Word (.docx) document:
    elif self.media_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
        try:
            pages = extract_docx_text(self.file_path)  # define or import your own extraction helper
        except Exception as err:
            self.set_error(f"Failed to parse Word document: {err}")
            return None  
    ...

Class variables

The AnnotationModel class has some optional class variables:

  • id: A short identifier for the model, used downstream. If not set, the class name (e.g. "SensitiveDocumentScanner") is used.

  • version: The version of the model, e.g. "1.0.0" (default: None)

  • variant: Additional identifier for the model, for when more than one variant exists, e.g. "nano" or "gpu-large" (default: None)

Sensitive Document Scanner Annotation Model
from dorsal import AnnotationModel
from dorsal.file.helpers import build_classification_record
from dorsal.file.preprocessing import extract_pdf_text

SENSITIVE_LABELS = {
    "Confidential": ["confidential", "do not distribute", "private"],
    "Internal": ["internal use only", "proprietary"],
}

class SensitiveDocumentScanner(AnnotationModel):
    id: str = "github:dorsalhub/annotation-model-examples"
    version: str = "1.0.0"

    def main(self) -> dict | None:
        try:
            pages = extract_pdf_text(self.file_path)
        except Exception as err:
            self.set_error(f"Failed to parse PDF: {err}")
            return None

        matches = set()
        for text in pages:
            text = text.lower()
            for label, keywords in SENSITIVE_LABELS.items():
                if any(k in text for k in keywords):
                    matches.add(label)

        return build_classification_record(
            labels=list(matches),
            vocabulary=list(SENSITIVE_LABELS.keys())
        )

Annotation Model id field

If you are publishing public annotations, the best way to document your annotation models is to use the id class variable.

For complex annotation models, where additional documentation may be valuable, a clear structure like github:my-organization/my-annotation-model is the most self-documenting.

Additionally, If you structure the id class variable to include either github or gitlab prefix, then when the annotation is viewed on DorsalHub, a clickable link is created.


Testing the Model

  • Let's test the model using the run_model function from dorsal.testing.

  • This time, we will create a PDF document dependency.

Testing an Annotation Model with dependencies
from dorsal.file.dependencies import make_media_type_dependency
from dorsal.testing import run_model

dependencies = [
  make_media_type_dependency(include=["application/pdf"])
] 

result = run_model(
    annotation_model=SensitiveDocumentScanner,
    file_path="./test/documents/expenses-Q3.pdf",
    dependencies=dependencies,
    schema_id="open/classification"
)

print(result.model_dump_json(indent=2))

The result will look something like this:

{
  "name": "SensitiveDocumentScanner",
  "source": {
    "type": "Model",
    "model": "github:dorsalhub/annotation-model-examples",
    "version": "1.0.0",
    "variant": null
  },
  "record": {
    "labels": [
      {
        "label": "Internal"
      }
    ],
    "vocabulary": [
      "Confidential",
      "Internal"
    ]
  },
  "schema_id": "open/classification",
  "schema_version": "1.0.0",
  "time_taken": 0.019788757024798542,
  "error": null
}

Adding it to the Pipeline

  • Now that we've tested the model, let's use the register_model function from dorsal.api to add it to our pipeline.

  • Make sure that SensitiveDocumentScanner is defined outside of __main__ in an importable path (e.g. in sensitivedocumentscanner.py)

Model must be importable

If you have been following this tutorial so far in Jupyter, make sure to move your SensitiveDocumentScanner class to a .py file before registering it to the pipeline.

Registering a model copies its import path to your project config file, so it must be defined outside of __main__.

e.g. from sensitivedocumentscanner import SensitiveDocumentScanner where sensitivedocumentscanner.py is in the same directory as your main script/notebook.

Registering a Model to the Annotation Model Pipeline
from dorsal.api import register_model
from dorsal.testing import make_media_type_dependency
from sensitivedocumentscanner import SensitiveDocumentScanner  # The model must be importable to be registered

dependencies = [
  make_media_type_dependency(include=["application/pdf"])
] 

register_model(
    annotation_model=SensitiveDocumentScanner,
    schema_id="open/classification",
    dependencies=dependencies
)

You can confirm the model exists in the pipeline by calling show_model_pipeline, or by running dorsal config pipeline show in the CLI.

Testing the pipeline

Let's try it out!

from dorsal import LocalFile

lf = LocalFile("./test/documents/expenses-Q3.pdf", overwrite_cache=False)
print(lf.to_json())

Prints the full File Record including our new model's annotation:

{
  "hash": "51d72e1186ae0b1e82ba0a4e77e10051231d8c4367f973a002e7afce79a7c094",
  "validation_hash": "ed39870c204c1e4d82b1aab87190169611fe8f046c24f06062bd6be222035e68",
  "annotations": {
    "file/base": {
      "record": {
        "hash": "51d72e1186ae0b1e82ba0a4e77e10051231d8c4367f973a002e7afce79a7c094",
        "name": "expenses-Q3.pdf",
        "extension": ".pdf",
        "size": 109600,
        "media_type": "application/pdf",
        "media_type_prefix": "application"
      },
      "source": {
        "type": "Model",
        "model": "dorsal/base",
        "version": "1.0.0"
      }
    },
    "file/pdf": {
      "record": {
        "title": "Expenses Quarter 3 2024",
        "producer": "PDFlib+PDI 9.1.2p1 (PHP5/Linux-x86_64)",
        "version": "1.7",
        "page_count": 2,
        "creation_date": "2024-10-06T17:17:45+02:00",
        "modified_date": "2024-10-11T10:32:03+01:00"
      },
      "source": {
        "type": "Model",
        "model": "dorsal/pdf",
        "version": "1.0.0",
        "variant": "pypdfium2"
      }
    },
    "open/classification": {
      "record": {
        "labels": [
          {
            "label": "Internal"
          }
        ],
        "vocabulary": [
          "Confidential",
          "Internal"
        ]
      },
      "private": true,
      "source": {
        "type": "Model",
        "model": "github:dorsalhub/annotation-model-examples",
        "version": "1.0.0"
      },
      "schema_version": "1.0.0"
    }
  },
  "tags": [],
  "source": "disk",
  "local_attributes": {
    "date_modified": "2025-03-29 10:27:33.973895+00:00",
    "date_accessed": "2025-11-25 11:12:43.808318+00:00",
    "date_created": "2025-11-21 13:24:33.932904+00:00",
    "file_path": "/dev/test/documents/expenses-Q3.pdf",
    "file_size_bytes": 109600,
    "file_permissions_mode": 33279,
    "inode": 199847233464780399,
    "number_of_links": 1
  }
}
dorsal file scan ./test/documents/expenses-Q3.pdf --overwrite-cache

Displays the full File Record including our new model's annotation:

📄 Scanning metadata for expenses-Q3.pdf
╭────────────────────────── File Record: expenses-Q3.pdf ────────────────────────────╮
│                                                                                    │
│  Hashes                                                                            │
│       SHA-256:  51d72e1186ae0b1e82ba0a4e77e10051231d8c4367f973a002e7afce79a7c094   │
│        BLAKE3:  ed39870c204c1e4d82b1aab87190169611fe8f046c24f06062bd6be222035e68   │
│                                                                                    │
│  File Info                                                                         │
│     Full Path:  /dev/test/documents/expenses-Q3.pdf                                │
│      Modified:  2025-03-29 10:27:33                                                │
│          Name:  expenses-Q3.pdf                                                    │
│          Size:  107 KiB                                                            │
│    Media Type:  application/pdf                                                    │
│                                                                                    │
│  Tags                                                                              │
│        No tags found.                                                              │
│                                                                                    │
│  Pdf Info                                                                          │
│             title:  Expenses Quarter 3 2024                                        │
│          producer:  PDFlib+PDI 9.1.2p1 (PHP5/Linux-x86_64)                         │
│           version:  1.7                                                            │
│        page_count:  2                                                              │
│     creation_date:  2024-10-06T17:17:45+02:00                                      │
│     modified_date:  2024-10-11T10:32:03+01:00                                      │
│                                                                                    │
│  Classification Info                                                               │
│         labels:                                                                    │
│          label:  Internal                                                          │
│     vocabulary:  Confidential, Internal                                            │
│                                                                                    │
│                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────╯

Cleanup

Let's remove SensitiveDocumentScanner from our pipeline using the CLI command dorsal config pipeline remove:

1
2
3
from dorsal.api import remove_model_by_name

remove_model_by_name("SensitiveDocumentScanner")
dorsal config pipeline remove SensitiveDocumentScanner 

In the final part of this series, we'll build a more complex Annotation Model for entity extraction.

➡️ Continue to 6. Custom Annotation Models Part 3: Entity Extraction