Skip to content

Custom Annotation Models Part 3: Entity Extraction

This is Part 3 of a 3-part series on building your own Annotation Models.

  • In Part 1: Hello, Word!:

    • We introduced the AnnotationModel class
    • We built a simple model to count words in a text file
    • We learned about dependencies and the dorsal.testing module
  • In Part 2: Classification:

    • We built a classification model which adds labels to PDF documents
    • We introduced open/classification - a schema for structuring and validating the result of any classification task.
    • We learned about class variables id, version and variant

In this final chapter, we will build a model which extracts structured information from Invoices.

This model represents a significant step up in complexity. It uses a Document Understanding model (LayoutLM) which analyzes not just the text, but the spatial arrangement and structure of the document to answer questions like "What is the invoice number?".

Prerequisites

  • The model we are building uses some standard machine learning libraries: PyTorch, Transformers and Accelerate.

  • You will also need the python image processing library Pillow.

  • Install these in your environment before proceeding:

    pip install transformers torch accelerate Pillow
    

InvoiceExtractor

In this guide, we will build a PDF document extraction model called InvoiceExtractor.

This model uses a pre-trained Question Answering model to find specific entities (Invoice Number, Date, and Total) within a document.

It outputs an open/entity-extraction annotation record. This is a structured schema for entity extraction tasks:

Here is how a schema-compliant annotation record might look, for a file processed by the InvoiceExtractor:

{
  "entities": [
    {
      "concept": "InvoiceNumber",
      "text": "INV-9920-XQ",
      "score": 0.99,
      "location": [
        {
          "type": "block",
          "page_number": 1,
          "box": { "x": 848, "y": 70, "width": 98, "height": 9 }
        }
      ]
    }
  ]
}
  • concept: The meaning of the entity (e.g., "InvoiceNumber").
  • text: The extracted text.
  • score: A confidence score provided by the model. A higher value means a more confident result.
  • location: A bounding box indicating where the text was found.

InvoiceExtractor is primarily a wrapper around a pre-trained LayoutLM model layoutlm-document-qa, fine-tuned for question answering tasks.

The inference is handled by a Hugging Face **document-question-answering** pipeline which we lazily load (to avoid slowing Dorsal down for unrelated tasks)

The logic that InvoiceExtractor captures is more involved than the models we built in parts 1 and 2 of this guide:

  • We define a set of questions to "ask" the model in the SCHEMA_MAP class variable. e.g. "What is the invoice number?"
  • We provide the underlying layoutlm-document-qa the extracted content from our PDF document
  • Finally, we manually construct and return a schema-compliant Annotation Record, mapping the raw result from the underlying model to the open/entity-extraction schema.

Here is the complete code for the model:

Invoice Extractor Annotation Model
import logging
import uuid
from typing import Any, ClassVar, Dict

from dorsal import AnnotationModel
from dorsal.file.preprocessing.pdf import (
    extract_pdf_layout_per_mille, 
    extract_pdf_pages,
    PDFPage
)

logger = logging.getLogger(__name__)

class InvoiceExtractor(AnnotationModel):
    """
    An invoice extractor that wraps a pre-trained LayoutLM model.

    Locally queries using a Hugging Face `document-question-answering` pipeline.

    Model: https://huggingface.co/impira/layoutlm-document-qa

    Returns an `open/entity-extraction` schema-validated annotation.
    """
    id = "github:dorsalhub/annotation-model-examples"
    version = "1.0.0"
    variant = "layoutlm-document-qa"

    _inference_engine: ClassVar[Any] = None
    _model_load_error: ClassVar[str | None] = None

    SCHEMA_MAP = [
        {
            "concept": "InvoiceNumber", 
            "label": "string",
            "question": "What is the invoice number?"
        },
        {
            "concept": "InvoiceDate", 
            "label": "date",
            "question": "What is the invoice date?"
        },
        {
            "concept": "GrandTotal", 
            "label": "money",
            "question": "What is the grand total amount?",
            "static_attributes": {
                "currency": "USD"
            }
        }
    ]

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        if self._inference_engine is None and self._model_load_error is None:
            self._load_inference_engine()

    @classmethod
    def _load_inference_engine(cls):
        """Loads the Hugging Face pipeline into the class attribute."""
        logger.info("Initializing LayoutLM inference engine...")
        try:
            from transformers import pipeline
            cls._inference_engine = pipeline(
                "document-question-answering", 
                model="impira/layoutlm-document-qa",
                device_map="auto"
            )
        except ImportError as err:
            cls._model_load_error = f"Missing dependencies: {err}"
        except Exception as err:
            cls._model_load_error = f"Failed to load model weights: {err}"

    def _build_entity(self, prediction: dict, target_config: dict, page: PDFPage) -> dict:
        """Map raw model prediction to the `open/entity-extraction` validation schema."""

        # Calculate the merged bounding box of the answer
        start, end = prediction['start'], prediction['end']
        tokens = page.tokens[start : end + 1]

        if not tokens:
            # Fallback if token mapping fails
            return None

        # Calculate the enclosing rectangle for all tokens in the answer
        x1 = min(t.box[0] for t in tokens)
        y1 = min(t.box[1] for t in tokens)
        x2 = max(t.box[2] for t in tokens)
        y2 = max(t.box[3] for t in tokens)

        raw_value = prediction['answer']

        entity = {
            "id": str(uuid.uuid4()),
            "concept": target_config["concept"],
            "label": target_config["label"],
            "text": raw_value,
            "normalized_value": raw_value, # In production, parse dates/floats here
            "score": prediction['score'],
            "location": [
                {
                    "type": "block",
                    "block_type": "box",
                    "page_number": page.page_number,
                    "unit": "per_mille",
                    "box": {
                        "x": x1,
                        "y": y1,
                        "width": x2 - x1,
                        "height": y2 - y1
                    }
                }
            ]
        }

        if "static_attributes" in target_config:
            entity["attributes"] = target_config["static_attributes"]

        return entity

    def main(self) -> Dict[str, Any] | None:
        # 1. Check Readiness
        if self._model_load_error or not self._inference_engine:
            self.set_error(self._model_load_error or "Engine not initialized")
            return None

        # 2. Extract Data (Visual + Spatial)
        try:
            # Get the Bounding Boxes (Spatial)
            pdf_layout = extract_pdf_layout_per_mille(self.file_path, strict=False)
            # Get the Images (Visual) - Consume generator immediately
            pdf_pages = list(extract_pdf_pages(self.file_path, scale=2.0))
        except Exception as err:
            self.set_error(f"PDF Processing failed: {err}")
            return None

        if not pdf_layout:
            self.set_error(f"PDF Processing failed: Failed to extract PDF")
            return None

        if len(pdf_layout) != len(pdf_pages):
            self.set_error("Mismatch between layout pages and visual pages.")
            return None

        extracted_entities = []

        # 3. Inference Loop
        for page_layout, page_image in zip(pdf_layout, pdf_pages):

            # Format tokens for LayoutLM
            word_boxes = [
                (token.text, list(token.box)) 
                for token in page_layout.tokens
            ]

            # Ask each question in our Blueprint
            for target in self.SCHEMA_MAP:
                try:
                    prediction = self._inference_engine(
                        image=page_image,               
                        word_boxes=word_boxes,          
                        question=target["question"]
                    )

                    if isinstance(prediction, list):
                        prediction = prediction[0]

                    if prediction:
                        # 4. Map to Schema
                        entity = self._build_entity(prediction, target, page_layout)
                        if entity is None:
                            continue
                        extracted_entities.append(entity)

                except Exception as err:
                    logger.warning(f"Inference failed: {err}")

        # 5. Return result
        if not extracted_entities:
            return {
                "vocabulary": list(set(i["concept"] for i in self.SCHEMA_MAP)),
                "entities": []
            }

        return {
            "vocabulary": list(set(i["concept"] for i in self.SCHEMA_MAP)),
            "entities": extracted_entities
        }
Deep Dive: Document Understanding & Layout Analysis

graph TD
    Input[("📄 Input PDF")] --> Check{Engine Loaded?}
    Check -- No --> Load["(A) Load Weights"]
    Check -- Yes --> Split
    Load --> Split

    subgraph "(B) Preprocessing"
        Split((Split)) --> Visual["Extract Images<br>(Required by API)"]
        Split --> Spatial["Extract Layout<br>(Text + BBox)"]
    end

    Visual --> Merge
    Spatial --> Merge

    subgraph "(C) Inference"
        Merge{{"Zip Streams"}} --> Tokenize["Map Tokens to Boxes"]
        Tokenize --> Query["Ask Questions<br>(LayoutLMv1)"]
    end

    Query --> Raw["Raw Predictions"]
    Raw --> Map["(D) _build_entity<br>(Schema Mapping)"]
    Map --> Output[("JSON Output<br>(open/entity-extraction)")]

The InvoiceExtractor uses a Document Understanding approach to entity extraction.

This design pattern is suitable for complex documents like Invoices or Forms with unstructured layouts: by utilizing Spatial data (where text is located), the model is able to learn more complex relationships.


Figure: InvoiceExtractor flow chart

A. Lazy Loading: The inference engine powering the InvoiceExtractor is fairly heavy (~500MB).

  • At run-time the model checks if self._inference_engine is None and only load it when the first file is processed.

    if self._inference_engine is None:
        self._load_inference_engine()
    

B. Preprocessing & Inputs: We split the PDF into two parallel streams.

  1. Spatial (The Brains): We extract the bounding box of every word. LayoutLMv1 primarily relies on this 2D positional information (Text + Coordinates) to understand the document structure.
  2. Visual (The Container): We render the page images. While newer models (like LayoutLMv3) analyze pixels directly, the model we are using (v1) relies on the spatial grid. However, the pipeline API requires the image to validate dimensions or perform fallback OCR if we fail to provide text.

    # Visual (Required by Pipeline API)
    pdf_pages = extract_pdf_pages(self.file_path, scale=2.0)
    # Spatial (The actual model inputs)
    pdf_layout = extract_pdf_layout_per_mille(self.file_path)
    

C. Inference: We process the document page-by-page. For each page, we structure the layout data into word_boxes. We then loop through our defined schema questions (e.g., "What is the total?").

  • Note: By providing word_boxes explicitly, we bypass the pipeline's internal Tesseract OCR. This is typically much faster, but assumes a text layer exists on the PDF.

    for page_layout, page_image in zip(pdf_layout, pdf_pages):
        # 1. Structure Layout Data
        word_boxes = [(t.text, list(t.box)) for t in page_layout.tokens]
    
        # 2. Ask Every Question
        for target in self.SCHEMA_MAP:
            prediction = self._inference_engine(
                image=page_image,
                word_boxes=word_boxes,
                question=target["question"]
            )
    

D. Schema Mapping: The raw model output returns a list of individual tokens (words). We must combine them into a single logical entity that validates against open/entity-extraction.

  • Union Box: We calculate the min(x) and max(x) of all tokens in the answer to create one bounding box that encompasses the entire phrase.
  • Concept vs Label: Following the open/entity-extraction schema, we map the business meaning (e.g., 'InvoiceDate') to concept and the data type (e.g., 'date') to label.

    entity = {
        "id": str(uuid.uuid4()),
        "concept": target_config["concept"], # e.g. "InvoiceNumber"
        "label": target_config["label"],     # e.g. "string"
        "text": raw_value,
        "normalized_value": raw_value,       # In production, cast this to float/date
        "score": prediction['score'],
        "location": [
            {
                "type": "block",
                "block_type": "box",
                "page_number": page.page_number,
                "unit": "per_mille",         # Schema requires knowing the unit
                "box": {
                    "x": x1,                 # Union of all token X coordinates
                    "y": y1,
                    "width": x2 - x1,
                    "height": y2 - y1
                }
            }
        ]
    }
    


main method

1. Check Readiness:

  • For models with a startup cost (in this case, loading underlying Hugging Face + Pytorch _inference_engine could take well over a minute, depending on your hardware) it is advisable to implement a "lazy loading" pattern.

  • By using a class variable and checking it in __init__, we ensure the heavy underlying "engine" (pipeline and inference model) is only loaded:

    1. Once (even if we process thousands of documents, it is shared across all instances of the class).
    2. On Demand (only when the pipeline actually runs this model).
  • In main we check if self._model_load_error or not self._inference_engine to see if the engine is ready. If it is not, we set an error message and return None.

    def main(self) -> Dict[str, Any] | None:
        # 1. Check Readiness
        if self._model_load_error or not self._inference_engine:
            self.set_error(self._model_load_error or "Engine not initialized")
            return None

2. Extract Data (Visual + Spatial):

This model needs two parallel inputs to process a page:

  1. Visual: We extract the page image because the underlying Hugging Face pipeline requires it to establish the document dimensions and coordinate system. We use extract_pdf_pages to get this.

  2. Spatial: We extract the bounding box of every word. The model uses these coordinates to understand the document structure (e.g., distinguishing a header from a table row). We use extract_pdf_layout_per_mille to get this.

  3. If extraction fails for any reason, we set an error message and return None.

    def main(self) -> Dict[str, Any] | None:
        # 1. Check Readiness
        if self._model_load_error or not self._inference_engine:
            self.set_error(self._model_load_error or "Engine not initialized")
            return None

        # 2. Extract Data (Visual + Spatial)
        try:
            # Get the Bounding Boxes (Spatial)
            pdf_layout = extract_pdf_layout_per_mille(self.file_path, strict=False)
            # Get the Images (Visual) - Consume generator immediately
            pdf_pages = list(extract_pdf_pages(self.file_path, scale=2.0))
        except Exception as err:
            self.set_error(f"PDF Processing failed: {err}")
            return None

        if not pdf_layout:
            self.set_error(f"PDF Processing failed: Failed to extract PDF")
            return None

        if len(pdf_layout) != len(pdf_pages):
            self.set_error("Mismatch between layout pages and visual pages.")
            return None

3. Inference:

  • We zip these two streams together in the main loop to process the document page-by-page

  • We pass the underlying _inference_engine three things:

    1. The PDF page as a PIL.Image object
    2. The bounding box and text of every word extracted from the PDF by the extract_pdf_layout_per_mille function.
    3. Each "question" from the class-level SCHEMA_MAP dictionary
    prediction = self._inference_engine(
        image=page_image,               
        word_boxes=word_boxes,          
        question=target["question"]
    )
    

4. Map to Schema:

The logic of converting outputs from models as complex as this are usually quite specialized, so rather than trying to use a helper from dorsal.file.helpers, we are better off doing the mapping and calculations ourselves.

We handle this by creating a helper method _build_entity on the InvoiceExtractor, which maps entities extracted from the underlying Hugging Face pipeline to the open/entity-extraction schema.

  • unit:The underlying model uses a 0-1000 coordinate system, defined in open/classification as per_mille. We could convert it to another unit, but open/classification supports per_mille so we can leave it as it is.
  • box: We calculate the {x, y, width, height} rectangle containing the answer by merging the bounding boxes of the tokens identified by the model. This creates a box around the tokens which informed the classification.
def _build_entity(self, prediction: dict, target_config: dict, page: PDFPage) -> dict:
        """Map raw model prediction to the `open/entity-extraction` validation schema."""

        # Calculate the merged bounding box of the answer
        start, end = prediction['start'], prediction['end']
        tokens = page.tokens[start : end + 1]

        if not tokens:
            # Fallback if token mapping fails
            return None

        # Calculate the enclosing rectangle for all tokens in the answer
        x1 = min(t.box[0] for t in tokens)
        y1 = min(t.box[1] for t in tokens)
        x2 = max(t.box[2] for t in tokens)
        y2 = max(t.box[3] for t in tokens)

        raw_value = prediction['answer']

        entity = {
            "id": str(uuid.uuid4()),
            "concept": target_config["concept"],
            "label": target_config["label"],
            "text": raw_value,
            "value": raw_value, # In production, you might want to parse this
            "score": prediction['score'],
            "location": [
                {
                    "type": "block",
                    "block_type": "box",
                    "page_number": page.page_number,
                    "unit": "per_mille",
                    "box": {
                        "x": x1,
                        "y": y1,
                        "width": x2 - x1,
                        "height": y2 - y1
                    }
                }
            ]
        }

        if "static_attributes" in target_config:
            entity["attributes"] = target_config["static_attributes"]

        return entity

5. Return Result:

  • Finally, we return a schema-compliant dictionary from the main method.

  • In the case where the model was unable to extract any entities, we can safely return an empty list alongside the vocabulary, to tell downstream consumers "I tried to find these particular entities, but none were there"

    def main(self) -> Dict[str, Any] | None:
        # 1. Check Readiness
        if self._model_load_error or not self._inference_engine:
            self.set_error(self._model_load_error or "Engine not initialized")
            return None

        # 2. Extract Data (Visual + Spatial)
        try:
            # Get the Bounding Boxes (Spatial)
            pdf_layout = extract_pdf_layout_per_mille(self.file_path, strict=False)
            # Get the Images (Visual) - Consume generator immediately
            pdf_pages = list(extract_pdf_pages(self.file_path, scale=2.0))
        except Exception as err:
            self.set_error(f"PDF Processing failed: {err}")
            return None

        if not pdf_layout:
            self.set_error(f"PDF Processing failed: Failed to extract PDF")
            return None

        if len(pdf_layout) != len(pdf_pages):
            self.set_error("Mismatch between layout pages and visual pages.")
            return None

        extracted_entities = []

        # 3. Inference Loop
        for page_layout, page_image in zip(pdf_layout, pdf_pages):

            # Format tokens for LayoutLM
            word_boxes = [
                (token.text, list(token.box)) 
                for token in page_layout.tokens
            ]

            # Ask each question in our Blueprint
            for target in self.SCHEMA_MAP:
                try:
                    prediction = self._inference_engine(
                        image=page_image,               
                        word_boxes=word_boxes,          
                        question=target["question"]
                    )

                    if isinstance(prediction, list):
                        prediction = prediction[0]

                    if prediction:
                        # 4. Map to Schema
                        entity = self._build_entity(prediction, target, page_layout)
                        if entity is None:
                            continue
                        extracted_entities.append(entity)

                except Exception as err:
                    logger.warning(f"Inference failed: {err}")

        # 5. Return result
        if not extracted_entities:
            return {
                "vocabulary": list(set(i["concept"] for i in self.SCHEMA_MAP)),
                "entities": []
            }

        return {
            "vocabulary": list(set(i["concept"] for i in self.SCHEMA_MAP)),
            "entities": extracted_entities
        }

Testing the Model

First Run Speed

Because it downloads weights and instantiates a fairly heavy Hugging Face pipeline in the background, processing the first document will typically be much slower than subsequent documents. This is called a "cold start".

As the model weights are cached locally (typically in ~/.cache/huggingface**) subsequent runs ("warm start") will be faster as they skip downloading and initializing.

Testing the Invoice Extractor
from dorsal.file.dependencies import make_media_type_dependency
from dorsal.testing import run_model

dependencies = [
  make_media_type_dependency(include=["application/pdf"])
]

result = run_model(
    annotation_model=InvoiceExtractor,
    file_path="./test/documents/invoice_001.pdf",
    dependencies=dependencies,
    schema_id="open/entity-extraction"
)

print(result.model_dump_json(indent=2))

The output of the RunModelResult shows all extracted entities:

{
  "name": "InvoiceExtractor",
  "source": {
    "type": "Model",
    "model": "github:dorsalhub/annotation-model-examples",
    "version": "1.0.0",
    "variant": "layoutlm-document-qa"
  },
  "record": {
    "vocabulary": [
      "date",
      "string",
      "money"
    ],
    "entities": [
      {
        "id": "c4bb0712-b2c2-46b2-9964-001d10a9e5bf",
        "concept": "InvoiceNumber",
        "label": "string",
        "text": "INV-9920-XQ",
        "normalized_value": "INV-9920-XQ",
        "score": 0.9988334774971008,
        "location": [
          {
            "type": "block",
            "block_type": "box",
            "page_number": 1,
            "unit": "per_mille",
            "box": {
              "x": 848,
              "y": 70,
              "width": 98,
              "height": 9
            }
          }
        ]
      },
      {
        "id": "ce3bbe77-f2a4-486a-b851-5cb84b46e17c",
        "concept": "InvoiceDate",
        "label": "date",
        "text": "November 21, 2025",
        "normalized_value": "November 21, 2025",
        "score": 0.9999126195907593,
        "location": [
          {
            "type": "block",
            "block_type": "box",
            "page_number": 1,
            "unit": "per_mille",
            "box": {
              "x": 801,
              "y": 87,
              "width": 145,
              "height": 10
            }
          }
        ]
      },
      {
        "id": "ec868cd6-ad18-4c43-a4ff-fe230c68a0e2",
        "concept": "GrandTotal",
        "label": "money",
        "text": "$3,572.25",
        "normalized_value": "$3,572.25",
        "score": 0.9905691146850586,
        "location": [
          {
            "type": "block",
            "block_type": "box",
            "page_number": 1,
            "unit": "per_mille",
            "box": {
              "x": 873,
              "y": 103,
              "width": 73,
              "height": 11
            }
          }
        ],
        "attributes": {
          "currency": "USD"
        }
      }
    ]
  },
  "schema_id": "open/entity-extraction",
  "time_taken": 12.481572052987758,
  "error": null
}

Adding it to the Pipeline

  • Now that we've tested the model, let's use the register_model function from dorsal.api to add it to our pipeline.

  • Make sure that InvoiceExtractor is defined outside of __main__ in an importable path (e.g. in invoice_extractor.py)

Model must be importable

If you have been following this tutorial so far in Jupyter, make sure to move your InvoiceExtractor class to a .py file before registering it to the pipeline.

Registering a model copies its import path to your project config file, so it must be defined outside of __main__.

e.g. from invoice_extractor import InvoiceExtractor where invoice_extractor.py is in the same directory as your main script/notebook.

Registering the Invoice Extractor
from dorsal.api import register_model
from dorsal.file.dependencies import make_media_type_dependency
from invoice_extractor import InvoiceExtractor

dependencies = [
  make_media_type_dependency(include=["application/pdf"])
] 

register_model(
    annotation_model=InvoiceExtractor,
    schema_id="open/entity-extraction",
    dependencies=dependencies
)

You can confirm the model exists in the pipeline by calling show_model_pipeline, or by running dorsal config pipeline show in the CLI.

Testing the Pipeline

Now, when we scan a PDF, we get entity extraction automatically:

from dorsal import LocalFile

# Process the invoice
lf = LocalFile("./test/documents/invoice_001.pdf", use_cache=False)

# Access the specific schema
extraction = lf.get_annotation("open/entity-extraction")

if extraction:
    for entity in extraction.record.entities:
        print(f"{entity.concept}: {entity.text} ({entity.score:.2f})")
    print()

Output:

InvoiceNumber: INV-9920-XQ (1.00)
InvoiceDate: November 21, 2025 (1.00)
GrandTotal: $3,572.25 (0.99)

dorsal file scan ./test/documents/invoice_001.pdf --skip-cache

Output:

📄 Scanning metadata for simple_invoice_demo.pdf
╭─────────────────────── File Record: simple_invoice_demo.pdf ───────────────────────╮
│                                                                                    │
│  Hashes                                                                            │
│       SHA-256:  768b1efb2a702ed4086eeb70de219bd02659e7929e1a6c17afb323e1048028b6   │
│        BLAKE3:  73c02134b51ec66a468756ab5eef168efab68afe9f90ec456545cbb4425b8efb   │
│                                                                                    │
│  File Info                                                                         │
│     Full Path:  /dev/test/documents/invoice_001.pdf                                │
│      Modified:  2025-11-21 16:58:09                                                │
│          Name:  simple_invoice_demo.pdf                                            │
│          Size:  2 KiB                                                              │
│    Media Type:  application/pdf                                                    │
│                                                                                    │
│  Tags                                                                              │
│        No tags found.                                                              │
│                                                                                    │
│  Pdf Info                                                                          │
│          producer:  PyFPDF 1.7.2 http://pyfpdf.googlecode.com/                     │
│           version:  1.3                                                            │
│        page_count:  1                                                              │
│     creation_date:  2025-11-21T16:58:09                                            │
│                                                                                    │
│  Entity-Extraction Info                                                            │
│           vocabulary:  date, string, money                                         │
│             entities:                                                              │
│                   id:  0dcff482-4a2f-41db-ac79-cad9c407093c                        │
│              concept:  InvoiceNumber                                               │
│                label:  string                                                      │
│                 text:  INV-9920-XQ                                                 │
│     normalized_value:  INV-9920-XQ                                                 │
│                score:  0.9988334774971008                                          │
│             location:                                                              │
│                 type:  block                                                       │
│           block_type:  box                                                         │
│          page_number:  1                                                           │
│                 unit:  per_mille                                                   │
│                  box:                                                              │
│                    x:  848                                                         │
│                    y:  70                                                          │
│                width:  98                                                          │
│               height:  9                                                           │
│                   id:  db8b1c9f-9ee0-4d89-9e6f-14c0eab220b8                        │
│              concept:  InvoiceDate                                                 │
│                label:  date                                                        │
│                 text:  November 21, 2025                                           │
│     normalized_value:  November 21, 2025                                           │
│                score:  0.9999126195907593                                          │
│             location:                                                              │
│                 type:  block                                                       │
│           block_type:  box                                                         │
│          page_number:  1                                                           │
│                 unit:  per_mille                                                   │
│                  box:                                                              │
│                    x:  801                                                         │
│                    y:  87                                                          │
│                width:  145                                                         │
│               height:  10                                                          │
│                   id:  175361b4-1a99-417d-b42e-8a5e87067e7f                        │
│              concept:  GrandTotal                                                  │
│                label:  money                                                       │
│                 text:  $3,572.25                                                   │
│     normalized_value:  $3,572.25                                                   │
│                score:  0.9905691146850586                                          │
│             location:                                                              │
│                 type:  block                                                       │
│           block_type:  box                                                         │
│          page_number:  1                                                           │
│                 unit:  per_mille                                                   │
│                  box:                                                              │
│                    x:  873                                                         │
│                    y:  103                                                         │
│                width:  73                                                          │
│               height:  11                                                          │
│           attributes:                                                              │
│             currency:  USD                                                         │
│                                                                                    │
│                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────╯

Cleanup

Let's remove InvoiceExtractor from our pipeline using the CLI command dorsal config pipeline remove:

1
2
3
from dorsal.api import remove_model_by_name

remove_model_by_name("InvoiceExtractor")
dorsal config pipeline remove InvoiceExtractor 

Conclusion

Over this three-part series, you have gone from counting words in a text file to building a production-grade, multimodal AI extraction pipeline.

You have learned:

  1. Structure: How to wrap logic in AnnotationModel classes.
  2. Integration: How to register models and manage dependencies.
  3. Validation: How to use Schemas to ensure your data is clean, typed, and usable.

You can now build any extraction logic you can imagine, plug it into Dorsal, and automatically process your files into structured records.