Annotation Models
dorsal.file.annotation_models
EbookAnnotationModel
Bases: AnnotationModel
Extracts metadata from common ebook formats (currently only supports Epub).
Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
main
Extracts metadata by dispatching to the correct format-specific parser.
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
|
dict[str, Any] | None
|
|
Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/ebook/model.py
FileCoreAnnotationModel
Bases: AnnotationModel
Annotation model for extracting core file metadata.
- Calculate file hashes (SHA256, TLSH, BLAKE3 and QUICK).
- Determine basic file attributes: name, extension, size and media type.
- This model is designed for use in the
ModelRunner - Its
mainmethod outputs a dictionary conforming toFileCoreValidationModel.
Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
main
Main execution method for the FileCoreAnnotationModel.
Orchestrates the extraction of all fundamental file metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
calculate_similarity_hash
|
bool
|
If True, the TLSH similarity hash will be calculated and included in the results. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
A dictionary containing the extracted file metadata, conforming to |
dict[str, Any] | None
|
the structure expected by |
dict[str, Any] | None
|
Returns None if a recoverable error specific to this model's logic occurs |
dict[str, Any] | None
|
and |
dict[str, Any] | None
|
critical OS/IO errors propagate). |
Raises:
| Type | Description |
|---|---|
(FileNotFoundError, IOError, OSError)
|
If critical issues occur during file access (e.g., for hashing, size, media type determination). These are expected to be caught by the ModelRunner. |
Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/base/model.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | |
MediaInfoAnnotationModel
Bases: AnnotationModel
Extract metadata from media files using the pymediainfo library.
This model parses the output of MediaInfo (obtained as JSON) and organizes it into a structured dictionary with a main "General" track and lists for other track types (Video, Audio, Text, etc.).
Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
main
Extract, normalize, and structure metadata from the media file.
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
|
dict[str, Any] | None
|
|
Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/mediainfo/model.py
OfficeDocumentAnnotationModel
Bases: AnnotationModel
Extracts metadata from Microsoft Office formats (OOXML: .docx, .xlsx, .pptx). This model acts as a dispatcher, calling the correct stdlib-based parser.
Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
main
Dispatches to the correct format-specific parser based on media_type.
Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/office_document/model.py
PDFAnnotationModel
Bases: AnnotationModel
Extract metadata from PDF files using pypdfium2.
Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
main
Extract, normalize, and return metadata from the PDF file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
password
|
str | None
|
Optional password from pipeline config. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
Dictionary of normalized PDF metadata if successful. |
dict[str, Any] | None
|
None if the PDF cannot be read or essential metadata extraction fails, with |
Raises:
| Type | Description |
|---|---|
ImportError
|
If |
Exception
|
For other critical, unrecoverable errors from |