Annotation Model Demo: Document Summarization
This is a Bonus Chapter in our series on Custom Annotation Models.
This model demonstrates linking Dorsal with an LLM (ChatGPT) for metadata extraction.
This model extracts text from documents (PDFs, Word Docs, or Text files), sends it to the OpenAI API (ChatGPT), and standardizes the response.
ChatGPTSummarizer
The ChatGPTSummarizer is a wrapper around the OpenAI API that leverages Dorsal's internal preprocessing tools.
It handles:
1. Extraction: Converts PDFs and Office docs into raw text.
2. Orchestration: Sends the text to GPT-4o.
3. Storage: Saves the summary and cost data to the open/llm-output schema.
Prerequisites
You will need the openai python library and an API key.
The Code
Save this as chatgpt_summarizer.py.
| chatgpt_summarizer.py | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
Dependencies
Since this model supports multiple file formats, we should define a dependency list that includes all of them. This ensures the pipeline only runs the model on files it can actually process.
from dorsal.file.dependencies import make_media_type_dependency
# Define supported formats
supported_formats = [
"text/plain",
"text/markdown",
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document" # .docx
]
doc_dependency = make_media_type_dependency(include=supported_formats)
Testing the Summarizer
Now we can test the model on a real document to verify the extraction and summarization logic:
from dorsal.testing import run_model
from dorsal.file.dependencies import make_media_type_dependency
from chatgpt_summarizer import ChatGPTSummarizer
# 1. Define dependencies
supported_formats = [
"text/plain",
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
]
doc_dependency = make_media_type_dependency(include=supported_formats)
# 2. Run the model on a PDF
result = run_model(
annotation_model=ChatGPTSummarizer,
file_path="./documents/Q3_Report.pdf",
schema_id="open/llm-output",
dependencies=[doc_dependency]
)
if result.error:
print(f"Error: {result.error}")
else:
print(f"Model: {result.source.variant}")
print(f"Summary: \n{result.record['response_data']}")
Output:
Model: gpt-4o
Summary:
- Revenue increased by 15% quarter-over-quarter due to strong enterprise sales.
- The engineering team successfully deployed the new authentication microservice.
- Operational costs were higher than projected due to cloud infrastructure scaling.
Adding it to the Pipeline
-
Now that we've tested the model, let's use the
register_modelfunction fromdorsal.apito add it to our pipeline. -
Make sure that
ChatGPTSummarizeris defined outside of__main__in an importable path (e.g. inchatgpt_summarizer.py)
Model must be importable
If you have been following this tutorial so far in Jupyter, make sure to move your ChatGPTSummarizer class to a .py file before registering it to the pipeline.
Registering a model copies its import path to your project config file, so it must be defined outside of __main__.
e.g. from chatgpt_summarizer import ChatGPTSummarizer where chatgpt_summarizer.py is in the same directory as your main script/notebook.
Conclusion
By combining Dorsal's internal extraction tools (extract_pdf_text, extract_docx_text) with an external reasoning engine (ChatGPT), you have built a powerful pipeline step.
Your system can now ingest raw business documents and automatically generate concise, searchable summaries without manual intervention.