Preprocessing
dorsal.file.preprocessing
extract_pdf_layout_normalized
Extracts text and layout (bounding boxes) normalized to 0.0-1.0 floats.
Useful for resolution-independent geometry processing or Computer Vision models (like YOLO/R-CNN) that expect relative coordinates.
Coordinates
- Origin: Top-Left (0, 0)
- Unit: Float (0.0 to 1.0)
- Format: [x0, y0, x1, y1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the PDF file. |
required |
password
|
str | None
|
Password for encrypted PDFs. |
None
|
Returns:
| Type | Description |
|---|---|
list[PDFPage[float]]
|
list of PDFPage objects containing float coordinates. |
Source code in venv/lib/python3.13/site-packages/dorsal/file/preprocessing/pdf.py
extract_pdf_layout_per_mille
Extracts text and layout (bounding boxes) scaled to the 0-1000 'per_mille' integer standard.
This is the standard input format for models like LayoutLM (v1/v2/v3), LiLT, and Donut, which require integer coordinates discretized to a 1000x1000 grid.
Coordinates
- Origin: Top-Left (0, 0)
- Unit: Integer (0 to 1000)
- Format: [x0, y0, x1, y1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the PDF file. |
required |
password
|
str | None
|
Password for encrypted PDFs. |
None
|
Returns:
| Type | Description |
|---|---|
list[PDFPage[int]]
|
list of PDFPage objects containing integer coordinates. |
Source code in venv/lib/python3.13/site-packages/dorsal/file/preprocessing/pdf.py
extract_pdf_layout_pts
Extracts text and layout using raw PDF Points (pt).
Note: While PDF uses a Bottom-Left origin internally, this function transforms coordinates to Top-Left origin to match standard computer vision and web coordinate systems.
Coordinates
- Origin: Top-Left (0, 0)
- Unit: PDF Points (1/72 inch) - Float
- Format: [x0, y0, x1, y1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the PDF file. |
required |
password
|
str | None
|
Password for encrypted PDFs. |
None
|
Returns:
| Type | Description |
|---|---|
list[PDFPage[float]]
|
list of PDFPage objects containing float coordinates. |
Source code in venv/lib/python3.13/site-packages/dorsal/file/preprocessing/pdf.py
extract_pdf_pages
Yields PDF pages as PIL.Image.Image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the PDF file. |
required |
scale
|
float
|
Rasterization scale. 1.0 ~= 72 DPI. 2.0 ~= 144 DPI (Recommended for OCR/VDU). |
2.0
|
password
|
str | None
|
Password for encrypted PDFs. |
None
|
pages
|
list[int] | None
|
Optional list of 0-indexed page numbers to render. If None, renders all. |
None
|
Yields:
| Type | Description |
|---|---|
Any
|
PIL.Image.Image: A Pillow image object for the page. |
Raises:
| Type | Description |
|---|---|
PDFProcessingError
|
If the file cannot be read. |
DependencyError
|
If 'pypdfium2' or 'Pillow' are missing. |
Source code in venv/lib/python3.13/site-packages/dorsal/file/preprocessing/pdf.py
extract_pdf_text
Extracts raw text from a PDF file, page by page.
This is a high-level helper designed for text-analysis models that does *not* require
spatial layout information (bounding boxes).
Args:
file_path: Path to the PDF file.
password: Password for encrypted PDFs.
Returns:
list[str]: A list of strings, where each string is the raw text content of a single page.
Example:
>>> pages = extract_pdf_text("contract.pdf")
>>> print(pages[0])
"2023-01-03
LEGAL CONTRACT Regarding..."