Custom Annotation Models Part 3: Entity Extraction
This is Part 3 of a 3-part series on building your own Annotation Models.
-
- We introduced the
AnnotationModelclass - We built a simple model to count words in a text file
- We learned about dependencies and the
dorsal.testingmodule
- We introduced the
-
- We built a classification model which adds labels to PDF documents
- We introduced
open/classification- a schema for structuring and validating the result of any classification task. - We learned about class variables
id,versionandvariant
In this final chapter, we will build a model which extracts structured information from Invoices.
This model represents a significant step up in complexity. It uses a Document Understanding model (LayoutLM) which analyzes not just the text, but the spatial arrangement and structure of the document to answer questions like "What is the invoice number?".
Prerequisites
-
The model we are building uses some standard machine learning libraries: PyTorch, Transformers and Accelerate.
-
You will also need the python image processing library Pillow.
-
Install these in your environment before proceeding:
InvoiceExtractor
In this guide, we will build a PDF document extraction model called InvoiceExtractor.
This model uses a pre-trained Question Answering model to find specific entities (Invoice Number, Date, and Total) within a document.
It outputs an open/entity-extraction annotation record. This is a structured schema for entity extraction tasks:
Here is how a schema-compliant annotation record might look, for a file processed by the InvoiceExtractor:
{
"entities": [
{
"concept": "InvoiceNumber",
"text": "INV-9920-XQ",
"score": 0.99,
"location": [
{
"type": "block",
"page_number": 1,
"box": { "x": 848, "y": 70, "width": 98, "height": 9 }
}
]
}
]
}
concept: The meaning of the entity (e.g., "InvoiceNumber").text: The extracted text.score: A confidence score provided by the model. A higher value means a more confident result.location: A bounding box indicating where the text was found.
InvoiceExtractor is primarily a wrapper around a pre-trained LayoutLM model layoutlm-document-qa, fine-tuned for question answering tasks.
The inference is handled by a Hugging Face **document-question-answering** pipeline which we lazily load (to avoid slowing Dorsal down for unrelated tasks)
The logic that InvoiceExtractor captures is more involved than the models we built in parts 1 and 2 of this guide:
- We define a set of questions to "ask" the model in the
SCHEMA_MAPclass variable. e.g. "What is the invoice number?" - We provide the underlying
layoutlm-document-qathe extracted content from our PDF document - Finally, we manually construct and return a schema-compliant Annotation Record, mapping the raw result from the underlying model to the
open/entity-extractionschema.
Here is the complete code for the model:
| Invoice Extractor Annotation Model | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | |
Deep Dive: Document Understanding & Layout Analysis
graph TD
Input[("📄 Input PDF")] --> Check{Engine Loaded?}
Check -- No --> Load["(A) Load Weights"]
Check -- Yes --> Split
Load --> Split
subgraph "(B) Preprocessing"
Split((Split)) --> Visual["Extract Images<br>(Required by API)"]
Split --> Spatial["Extract Layout<br>(Text + BBox)"]
end
Visual --> Merge
Spatial --> Merge
subgraph "(C) Inference"
Merge{{"Zip Streams"}} --> Tokenize["Map Tokens to Boxes"]
Tokenize --> Query["Ask Questions<br>(LayoutLMv1)"]
end
Query --> Raw["Raw Predictions"]
Raw --> Map["(D) _build_entity<br>(Schema Mapping)"]
Map --> Output[("JSON Output<br>(open/entity-extraction)")]
The InvoiceExtractor uses a Document Understanding approach to entity extraction.
This design pattern is suitable for complex documents like Invoices or Forms with unstructured layouts: by utilizing Spatial data (where text is located), the model is able to learn more complex relationships.
Figure: InvoiceExtractor flow chart
A. Lazy Loading: The inference engine powering the InvoiceExtractor is fairly heavy (~500MB).
-
At run-time the model checks
if self._inference_engine is Noneand only load it when the first file is processed.
B. Preprocessing & Inputs: We split the PDF into two parallel streams.
- Spatial (The Brains): We extract the bounding box of every word. LayoutLMv1 primarily relies on this 2D positional information (Text + Coordinates) to understand the document structure.
-
Visual (The Container): We render the page images. While newer models (like LayoutLMv3) analyze pixels directly, the model we are using (v1) relies on the spatial grid. However, the pipeline API requires the image to validate dimensions or perform fallback OCR if we fail to provide text.
C. Inference: We process the document page-by-page. For each page, we structure the layout data into word_boxes. We then loop through our defined schema questions (e.g., "What is the total?").
-
Note: By providing
word_boxesexplicitly, we bypass the pipeline's internal Tesseract OCR. This is typically much faster, but assumes a text layer exists on the PDF.for page_layout, page_image in zip(pdf_layout, pdf_pages): # 1. Structure Layout Data word_boxes = [(t.text, list(t.box)) for t in page_layout.tokens] # 2. Ask Every Question for target in self.SCHEMA_MAP: prediction = self._inference_engine( image=page_image, word_boxes=word_boxes, question=target["question"] )
D. Schema Mapping: The raw model output returns a list of individual tokens (words). We must combine them into a single logical entity that validates against open/entity-extraction.
- Union Box: We calculate the
min(x)andmax(x)of all tokens in the answer to create one bounding box that encompasses the entire phrase. -
Concept vs Label: Following the
open/entity-extractionschema, we map the business meaning (e.g., 'InvoiceDate') toconceptand the data type (e.g., 'date') tolabel.entity = { "id": str(uuid.uuid4()), "concept": target_config["concept"], # e.g. "InvoiceNumber" "label": target_config["label"], # e.g. "string" "text": raw_value, "normalized_value": raw_value, # In production, cast this to float/date "score": prediction['score'], "location": [ { "type": "block", "block_type": "box", "page_number": page.page_number, "unit": "per_mille", # Schema requires knowing the unit "box": { "x": x1, # Union of all token X coordinates "y": y1, "width": x2 - x1, "height": y2 - y1 } } ] }
main method
1. Check Readiness:
-
For models with a startup cost (in this case, loading underlying Hugging Face + Pytorch
_inference_enginecould take well over a minute, depending on your hardware) it is advisable to implement a "lazy loading" pattern. -
By using a class variable and checking it in
__init__, we ensure the heavy underlying "engine" (pipeline and inference model) is only loaded:- Once (even if we process thousands of documents, it is shared across all instances of the class).
- On Demand (only when the pipeline actually runs this model).
-
In
mainwe checkif self._model_load_error or not self._inference_engineto see if the engine is ready. If it is not, we set an error message and returnNone.
2. Extract Data (Visual + Spatial):
This model needs two parallel inputs to process a page:
-
Visual: We extract the page image because the underlying Hugging Face pipeline requires it to establish the document dimensions and coordinate system. We use
extract_pdf_pagesto get this. -
Spatial: We extract the bounding box of every word. The model uses these coordinates to understand the document structure (e.g., distinguishing a header from a table row). We use
extract_pdf_layout_per_milleto get this. -
If extraction fails for any reason, we set an error message and return
None.
3. Inference:
-
We zip these two streams together in the main loop to process the document page-by-page
-
We pass the underlying
_inference_enginethree things:- The PDF page as a
PIL.Imageobject - The bounding box and text of every word extracted from the PDF by the
extract_pdf_layout_per_millefunction. - Each "question" from the class-level
SCHEMA_MAPdictionary
- The PDF page as a
4. Map to Schema:
The logic of converting outputs from models as complex as this are usually quite specialized, so rather than trying to use a helper from dorsal.file.helpers, we are better off doing the mapping and calculations ourselves.
We handle this by creating a helper method _build_entity on the InvoiceExtractor, which maps entities extracted from the underlying Hugging Face pipeline to the open/entity-extraction schema.
unit:The underlying model uses a 0-1000 coordinate system, defined inopen/classificationasper_mille. We could convert it to another unit, butopen/classificationsupportsper_milleso we can leave it as it is.box: We calculate the{x, y, width, height}rectangle containing the answer by merging the bounding boxes of the tokens identified by the model. This creates a box around the tokens which informed the classification.
5. Return Result:
-
Finally, we return a schema-compliant dictionary from the
mainmethod. -
In the case where the model was unable to extract any entities, we can safely return an empty list alongside the vocabulary, to tell downstream consumers "I tried to find these particular entities, but none were there"
Testing the Model
- Let's test the model using the
run_modelfunction fromdorsal.testing.
First Run Speed
Because it downloads weights and instantiates a fairly heavy Hugging Face pipeline in the background, processing the first document will typically be much slower than subsequent documents. This is called a "cold start".
As the model weights are cached locally (typically in ~/.cache/huggingface**) subsequent runs ("warm start") will be faster as they skip downloading and initializing.
from dorsal.file.dependencies import make_media_type_dependency
from dorsal.testing import run_model
dependencies = [
make_media_type_dependency(include=["application/pdf"])
]
result = run_model(
annotation_model=InvoiceExtractor,
file_path="./test/documents/invoice_001.pdf",
dependencies=dependencies,
schema_id="open/entity-extraction"
)
print(result.model_dump_json(indent=2))
The output of the RunModelResult shows all extracted entities:
{
"name": "InvoiceExtractor",
"source": {
"type": "Model",
"model": "github:dorsalhub/annotation-model-examples",
"version": "1.0.0",
"variant": "layoutlm-document-qa"
},
"record": {
"vocabulary": [
"date",
"string",
"money"
],
"entities": [
{
"id": "c4bb0712-b2c2-46b2-9964-001d10a9e5bf",
"concept": "InvoiceNumber",
"label": "string",
"text": "INV-9920-XQ",
"normalized_value": "INV-9920-XQ",
"score": 0.9988334774971008,
"location": [
{
"type": "block",
"block_type": "box",
"page_number": 1,
"unit": "per_mille",
"box": {
"x": 848,
"y": 70,
"width": 98,
"height": 9
}
}
]
},
{
"id": "ce3bbe77-f2a4-486a-b851-5cb84b46e17c",
"concept": "InvoiceDate",
"label": "date",
"text": "November 21, 2025",
"normalized_value": "November 21, 2025",
"score": 0.9999126195907593,
"location": [
{
"type": "block",
"block_type": "box",
"page_number": 1,
"unit": "per_mille",
"box": {
"x": 801,
"y": 87,
"width": 145,
"height": 10
}
}
]
},
{
"id": "ec868cd6-ad18-4c43-a4ff-fe230c68a0e2",
"concept": "GrandTotal",
"label": "money",
"text": "$3,572.25",
"normalized_value": "$3,572.25",
"score": 0.9905691146850586,
"location": [
{
"type": "block",
"block_type": "box",
"page_number": 1,
"unit": "per_mille",
"box": {
"x": 873,
"y": 103,
"width": 73,
"height": 11
}
}
],
"attributes": {
"currency": "USD"
}
}
]
},
"schema_id": "open/entity-extraction",
"time_taken": 12.481572052987758,
"error": null
}
Adding it to the Pipeline
-
Now that we've tested the model, let's use the
register_modelfunction fromdorsal.apito add it to our pipeline. -
Make sure that
InvoiceExtractoris defined outside of__main__in an importable path (e.g. ininvoice_extractor.py)
Model must be importable
If you have been following this tutorial so far in Jupyter, make sure to move your InvoiceExtractor class to a .py file before registering it to the pipeline.
Registering a model copies its import path to your project config file, so it must be defined outside of __main__.
e.g. from invoice_extractor import InvoiceExtractor where invoice_extractor.py is in the same directory as your main script/notebook.
You can confirm the model exists in the pipeline by calling show_model_pipeline, or by running dorsal config pipeline show in the CLI.
Testing the Pipeline
Now, when we scan a PDF, we get entity extraction automatically:
from dorsal import LocalFile
# Process the invoice
lf = LocalFile("./test/documents/invoice_001.pdf", use_cache=False)
# Access the specific schema
extraction = lf.get_annotation("open/entity-extraction")
if extraction:
for entity in extraction.record.entities:
print(f"{entity.concept}: {entity.text} ({entity.score:.2f})")
print()
Output:
Output:
📄 Scanning metadata for simple_invoice_demo.pdf
â•─────────────────────── File Record: simple_invoice_demo.pdf ───────────────────────╮
│ │
│ Hashes │
│ SHA-256: 768b1efb2a702ed4086eeb70de219bd02659e7929e1a6c17afb323e1048028b6 │
│ BLAKE3: 73c02134b51ec66a468756ab5eef168efab68afe9f90ec456545cbb4425b8efb │
│ │
│ File Info │
│ Full Path: /dev/test/documents/invoice_001.pdf │
│ Modified: 2025-11-21 16:58:09 │
│ Name: simple_invoice_demo.pdf │
│ Size: 2 KiB │
│ Media Type: application/pdf │
│ │
│ Tags │
│ No tags found. │
│ │
│ Pdf Info │
│ producer: PyFPDF 1.7.2 http://pyfpdf.googlecode.com/ │
│ version: 1.3 │
│ page_count: 1 │
│ creation_date: 2025-11-21T16:58:09 │
│ │
│ Entity-Extraction Info │
│ vocabulary: date, string, money │
│ entities: │
│ id: 0dcff482-4a2f-41db-ac79-cad9c407093c │
│ concept: InvoiceNumber │
│ label: string │
│ text: INV-9920-XQ │
│ normalized_value: INV-9920-XQ │
│ score: 0.9988334774971008 │
│ location: │
│ type: block │
│ block_type: box │
│ page_number: 1 │
│ unit: per_mille │
│ box: │
│ x: 848 │
│ y: 70 │
│ width: 98 │
│ height: 9 │
│ id: db8b1c9f-9ee0-4d89-9e6f-14c0eab220b8 │
│ concept: InvoiceDate │
│ label: date │
│ text: November 21, 2025 │
│ normalized_value: November 21, 2025 │
│ score: 0.9999126195907593 │
│ location: │
│ type: block │
│ block_type: box │
│ page_number: 1 │
│ unit: per_mille │
│ box: │
│ x: 801 │
│ y: 87 │
│ width: 145 │
│ height: 10 │
│ id: 175361b4-1a99-417d-b42e-8a5e87067e7f │
│ concept: GrandTotal │
│ label: money │
│ text: $3,572.25 │
│ normalized_value: $3,572.25 │
│ score: 0.9905691146850586 │
│ location: │
│ type: block │
│ block_type: box │
│ page_number: 1 │
│ unit: per_mille │
│ box: │
│ x: 873 │
│ y: 103 │
│ width: 73 │
│ height: 11 │
│ attributes: │
│ currency: USD │
│ │
│ │
╰────────────────────────────────────────────────────────────────────────────────────╯
Cleanup
Let's remove InvoiceExtractor from our pipeline using the CLI command dorsal config pipeline remove:
Conclusion
Over this three-part series, you have gone from counting words in a text file to building a production-grade, multimodal AI extraction pipeline.
You have learned:
- Structure: How to wrap logic in
AnnotationModelclasses. - Integration: How to register models and manage dependencies.
- Validation: How to use Schemas to ensure your data is clean, typed, and usable.
You can now build any extraction logic you can imagine, plug it into Dorsal, and automatically process your files into structured records.