Annotation Model Demo: Document Summarization

This is a Bonus Chapter in our series on Custom Annotation Models.

This model demonstrates linking Dorsal with an LLM (ChatGPT) for metadata extraction.

This model extracts text from documents (PDFs, Word Docs, or Text files), sends it to the OpenAI API (ChatGPT), and standardizes the response.

`ChatGPTSummarizer`

The ChatGPTSummarizer is a wrapper around the OpenAI API that leverages Dorsal's internal preprocessing tools.

It handles: 1. Extraction: Converts PDFs and Office docs into raw text. 2. Orchestration: Sends the text to GPT-4o. 3. Storage: Saves the summary and cost data to the open/llm-output schema.

Prerequisites

You will need the openai python library and an API key.

pip install openai
export OPENAI_API_KEY="sk-..."

The Code

Save this as chatgpt_summarizer.py.

chatgpt_summarizer.py
import os
from typing import Any, Dict
from openai import OpenAI
from dorsal import AnnotationModel

from dorsal.file.preprocessing.pdf import extract_pdf_text
from dorsal.file.preprocessing.office import extract_docx_text

class ChatGPTSummarizer(AnnotationModel):
    """
    Summarizes documents (PDF, Docx, Text) using OpenAI's ChatGPT.
    """
    id = "github:dorsalhub/llm-examples"
    version = "1.0.0"

    variant = None
    _client = None

    def _get_client(self):
        if self._client is None:
            self._client = OpenAI()
        return self._client

    def _extract_text(self) -> str:
        """Helper to standardize text extraction across formats."""
        try:
            # 1. Handle PDF
            if self.media_type == "application/pdf":
                # Returns list[str] (pages), join them into one block
                pages = extract_pdf_text(self.file_path)
                return "\n".join(pages)

            # 2. Handle Word (.docx)
            elif self.media_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
                pages = extract_docx_text(self.file_path)
                return "\n".join(pages)

            # 3. Handle Text Files (Default)
            elif self.media_type.startswith("text/"):
                with open(self.file_path, "r", encoding="utf-8") as f:
                    return f.read()

            else:
                raise ValueError(f"Unsupported media type: {self.media_type}")

        except Exception as e:
            # Re-raise so main() can catch and log it properly
            raise RuntimeError(f"Extraction failed: {e}")

    def main(
        self, 
        model: str = "gpt-4o", 
        max_length: int = 15000
    ) -> Dict[str, Any] | None:

        self.variant = model

        # 1. Extract content
        try:
            content = self._extract_text()

            if not content.strip():
                self.set_error("File contains no extractable text.")
                return None

            # Simple truncation to prevent unexpected token costs
            if len(content) > max_length:
                content = content[:max_length] + "...[truncated]"

        except Exception as e:
            self.set_error(str(e))
            return None

        # 2. Call OpenAI
        client = self._get_client()
        system_prompt = "Summarize the following document in 3 concise bullet points."

        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": content}
                ],
                temperature=0.5
            )
        except Exception as e:
            self.set_error(f"OpenAI API Error: {e}")
            return None

        # 3. Map to 'open/llm-output' Schema
        result_message = response.choices[0].message

        return {
            "model": model,
            "prompt": content, 
            "response_data": result_message.content,
            "generation_params": {
                "system_prompt": system_prompt,
                "temperature": 0.5
            },
            "generation_metadata": {
                "response_id": response.id,
                "finish_reason": response.choices[0].finish_reason,
                "usage": {
                    "prompt_tokens": response.usage.prompt_tokens,
                    "completion_tokens": response.usage.completion_tokens,
                    "total_tokens": response.usage.total_tokens
                }
            }
        }

Dependencies

Since this model supports multiple file formats, we should define a dependency list that includes all of them. This ensures the pipeline only runs the model on files it can actually process.

from dorsal.file.dependencies import make_media_type_dependency

# Define supported formats
supported_formats = [
    "text/plain", 
    "text/markdown", 
    "application/pdf", 
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document" # .docx
]

doc_dependency = make_media_type_dependency(include=supported_formats)

Testing the Summarizer

Now we can test the model on a real document to verify the extraction and summarization logic:

Testing on a PDF

from dorsal.testing import run_model
from dorsal.file.dependencies import make_media_type_dependency
from chatgpt_summarizer import ChatGPTSummarizer

# 1. Define dependencies
supported_formats = [
    "text/plain", 
    "application/pdf", 
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
]
doc_dependency = make_media_type_dependency(include=supported_formats)

# 2. Run the model on a PDF
result = run_model(
    annotation_model=ChatGPTSummarizer,
    file_path="./documents/Q3_Report.pdf",
    schema_id="open/llm-output",
    dependencies=[doc_dependency]
)

if result.error:
    print(f"Error: {result.error}")
else:
    print(f"Model: {result.source.variant}")
    print(f"Summary: \n{result.record['response_data']}")

Output:

Model: gpt-4o
Summary: 
- Revenue increased by 15% quarter-over-quarter due to strong enterprise sales.
- The engineering team successfully deployed the new authentication microservice.
- Operational costs were higher than projected due to cloud infrastructure scaling.

Adding it to the Pipeline

Now that we've tested the model, let's use the register_model function from dorsal.api to add it to our pipeline.
Make sure that ChatGPTSummarizer is defined outside of __main__ in an importable path (e.g. in chatgpt_summarizer.py)

Model must be importable

If you have been following this tutorial so far in Jupyter, make sure to move your ChatGPTSummarizer class to a .py file before registering it to the pipeline.

Registering a model copies its import path to your project config file, so it must be defined outside of __main__.

e.g. from chatgpt_summarizer import ChatGPTSummarizer where chatgpt_summarizer.py is in the same directory as your main script/notebook.

Registering the Invoice Extractor
from dorsal.api import register_model
from dorsal.file.dependencies import make_media_type_dependency
from chatgpt_summarizer import ChatGPTSummarizer

supported_formats = [
    "text/plain", 
    "application/pdf", 
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
]
doc_dependency = make_media_type_dependency(include=supported_formats)

register_model(
    annotation_model=ChatGPTSummarizer,
    schema_id="open/llm-output",
    dependencies=[doc_dependency]
)

Conclusion

By combining Dorsal's internal extraction tools (extract_pdf_text, extract_docx_text) with an external reasoning engine (ChatGPT), you have built a powerful pipeline step.

Your system can now ingest raw business documents and automatically generate concise, searchable summaries without manual intervention.