Skip to content

Annotation Models

dorsal.file.annotation_models

EbookAnnotationModel

EbookAnnotationModel(file_path)

Bases: AnnotationModel

Extracts metadata from common ebook formats (currently only supports Epub).

Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main()

Extracts metadata by dispatching to the correct format-specific parser.

Returns:

Type Description
dict[str, Any] | None
  • Dictionary of ebook metadata if successful.
dict[str, Any] | None
  • None if the format is unsupported or parsing fails.
Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/ebook/model.py
def main(self) -> dict[str, Any] | None:
    """
    Extracts metadata by dispatching to the correct format-specific parser.

    Returns:
      * Dictionary of ebook metadata if successful.
      * None if the format is unsupported or parsing fails.
    """
    logger.debug(
        "EbookAnnotationModel: Starting metadata extraction for '%s'",
        self.file_path,
    )

    try:
        _, ext = os.path.splitext(self.file_path)
        parser_type = EBOOK_FORMAT_MAPPING.get(ext.lower())

        metadata: dict[str, Any] | None = None

        if parser_type == "epub":
            self.variant = "epub_stdlib"
            metadata = ebook_utils.extract_epub_metadata(self.file_path)

        else:
            self.error = f"Unsupported ebook format: '{ext}' for file: {self.file_path}"
            logger.info(self.error)
            return None

        if metadata is None:
            self.error = f"Failed to parse ebook metadata for file: {self.file_path} (parser: {self.variant})"
            logger.warning(self.error)
            return None

        logger.debug(
            "EbookAnnotationModel: Successfully processed '%s' with parser '%s'",
            self.file_path,
            self.variant,
        )
        return metadata

    except ImportError as e:
        self.error = f"Missing dependency for parser '{self.variant}': {e}. Cannot process file."
        logger.error(self.error, exc_info=True)
        return None

    except (FileNotFoundError, IOError, OSError) as e:
        self.error = f"File system error during ebook processing: {type(e).__name__}: {e}"
        logger.error(
            "EbookAnnotationModel: CRITICAL OS/IO Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise

    except Exception as e:
        self.error = f"Unexpected error during EbookAnnotationModel processing: {type(e).__name__}: {e}"
        logger.error(
            "EbookAnnotationModel: UNEXPECTED Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise

FileCoreAnnotationModel

FileCoreAnnotationModel(file_path)

Bases: AnnotationModel

Annotation model for extracting core file metadata.

  • Calculate file hashes (SHA256, TLSH, BLAKE3 and QUICK).
  • Determine basic file attributes: name, extension, size and media type.
  • This model is designed for use in the ModelRunner
  • Its main method outputs a dictionary conforming to FileCoreValidationModel.
Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main(calculate_similarity_hash=False)

Main execution method for the FileCoreAnnotationModel.

Orchestrates the extraction of all fundamental file metadata.

Parameters:

Name Type Description Default
calculate_similarity_hash bool

If True, the TLSH similarity hash will be calculated and included in the results. Defaults to False.

False

Returns:

Type Description
dict[str, Any] | None

A dictionary containing the extracted file metadata, conforming to

dict[str, Any] | None

the structure expected by FileCoreValidationModelStrict.

dict[str, Any] | None

Returns None if a recoverable error specific to this model's logic occurs

dict[str, Any] | None

and self.error is set (though current implementation tends to let

dict[str, Any] | None

critical OS/IO errors propagate).

Raises:

Type Description
(FileNotFoundError, IOError, OSError)

If critical issues occur during file access (e.g., for hashing, size, media type determination). These are expected to be caught by the ModelRunner.

Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/base/model.py
def main(self, calculate_similarity_hash: bool = False) -> dict[str, Any] | None:
    """
    Main execution method for the FileCoreAnnotationModel.

    Orchestrates the extraction of all fundamental file metadata.

    Args:
        calculate_similarity_hash: If True, the TLSH similarity hash will be
                                   calculated and included in the results.
                                   Defaults to False.

    Returns:
        A dictionary containing the extracted file metadata, conforming to
        the structure expected by `FileCoreValidationModelStrict`.
        Returns None if a recoverable error specific to this model's logic occurs
        and `self.error` is set (though current implementation tends to let
        critical OS/IO errors propagate).

    Raises:
        FileNotFoundError, IOError, OSError: If critical issues occur during
            file access (e.g., for hashing, size, media type determination).
            These are expected to be caught by the ModelRunner.
    """
    try:
        logger.debug(
            "FileCoreAnnotationModel main: Starting processing for '%s'",
            self.file_path,
        )

        hashes = self._get_file_hashes(calculate_similarity_hash=calculate_similarity_hash)
        primary_hash = hashes.get("SHA-256")
        if not primary_hash:
            self.error = "Core SHA-256 hash calculation failed."
            logger.error(self.error + " File: '%s'", self.file_path)
            return None

        tlsh_hash = hashes.get("TLSH")
        quick_hash = hashes.get("QUICK")

        all_hashes_list = [{"id": hash_name, "value": hash_value} for hash_name, hash_value in hashes.items()]

        file_name = self._get_filename()
        file_extension = self._get_file_extension(file_name=file_name)
        file_size = self._get_filesize()

        media_type = self._get_media_type(file_extension=file_extension)

        logger.debug(
            "FileCoreAnnotationModel main: Successfully processed '%s'",
            self.file_path,
        )
        return {
            "hash": primary_hash,
            "similarity_hash": tlsh_hash,
            "quick_hash": quick_hash,
            "all_hashes": all_hashes_list,
            "name": file_name,
            "extension": file_extension,
            "size": file_size,
            "media_type": media_type,
        }
    except (FileNotFoundError, IOError, OSError) as e:
        self.error = f"File system error during processing: {type(e).__name__}: {e}"
        logger.error(
            "FileCoreAnnotationModel main: CRITICAL OS/IO Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise
    except Exception as e:
        self.error = f"Unexpected error during FileCoreAnnotationModel processing: {type(e).__name__}: {e}"
        logger.error(
            "FileCoreAnnotationModel main: UNEXPECTED Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise

MediaInfoAnnotationModel

MediaInfoAnnotationModel(file_path)

Bases: AnnotationModel

Extract metadata from media files using the pymediainfo library.

This model parses the output of MediaInfo (obtained as JSON) and organizes it into a structured dictionary with a main "General" track and lists for other track types (Video, Audio, Text, etc.).

Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main()

Extract, normalize, and structure metadata from the media file.

Returns:

Type Description
dict[str, Any] | None
  • Dictionary of structured MediaInfo data if successful.
dict[str, Any] | None
  • None if pymediainfo library is unavailable, file cannot be parsed, or essential data is missing. self.error will be set in case of failure.
Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/mediainfo/model.py
def main(self) -> dict[str, Any] | None:
    """
    Extract, normalize, and structure metadata from the media file.

    Returns:
      * Dictionary of structured MediaInfo data if successful.
      * None if pymediainfo library is unavailable, file cannot be parsed, or essential data is missing.
        `self.error` will be set in case of failure.
    """
    if not PYMEDIAINFO_AVAILABLE:
        self.error = f"pymediainfo library is not installed; cannot process media file: '{self.file_path}'"
        logger.error(self.error)
        return None

    logger.debug("MediaInfoAnnotationModel: Starting parsing for '%s'", self.file_path)

    try:
        raw_mediainfo_json_str: str = MediaInfo.parse(filename=self.file_path, output="JSON")
        mediainfo_data = json.loads(raw_mediainfo_json_str)
        logger.debug("MediaInfo.parse and json.loads successful for '%s'", self.file_path)

    except FileNotFoundError:
        self.error = f"Media file not found at path: {self.file_path}"
        logger.error(self.error)
        return None
    except (RuntimeError, OSError) as err:
        self.error = f"pymediainfo failed to parse file '{self.file_path}': {err}"
        logger.exception("MediaInfoAnnotationModel: %s", self.error)
        return None
    except json.JSONDecodeError as err:
        self.error = f"Failed to decode JSON output from MediaInfo for file '{self.file_path}': {err}"
        logger.exception(
            "MediaInfoAnnotationModel: %s. Raw output snippet: %.200s",
            self.error,
            raw_mediainfo_json_str or "",
        )
        return None
    except Exception as err:
        self.error = f"An unexpected error occurred during MediaInfo parsing of '{self.file_path}': {err}"
        logger.exception("MediaInfoAnnotationModel: %s", self.error)
        return None

    try:
        track_list: list[dict[str, Any]] = mediainfo_data["media"]["track"]
        creating_library_data = mediainfo_data.get("creatingLibrary")
    except (KeyError, TypeError) as err:
        self.error = (
            f"MediaInfo JSON output for '{self.file_path}' missing expected structure ('media.track'): {err}"
        )
        logger.exception(
            "MediaInfoAnnotationModel: %s. Data snippet: %s",
            self.error,
            str(mediainfo_data)[:500],
        )
        return None

    normalized_track_list = self._normalize_track_list(track_list=track_list)
    grouped_tracks = self._extract_and_group_tracks(track_list=normalized_track_list)

    if grouped_tracks is None:
        return None

    general_track = grouped_tracks.pop("General")
    final_record = {**general_track, **grouped_tracks}

    final_record["creatingLibrary"] = creating_library_data
    if not creating_library_data:
        logger.debug(
            "MediaInfo output for '%s' did not contain 'creatingLibrary' information.",
            self.file_path,
        )

    logger.debug("MediaInfoAnnotationModel: Successfully processed file '%s'", self.file_path)
    return final_record

OfficeDocumentAnnotationModel

OfficeDocumentAnnotationModel(file_path)

Bases: AnnotationModel

Extracts metadata from Microsoft Office formats (OOXML: .docx, .xlsx, .pptx). This model acts as a dispatcher, calling the correct stdlib-based parser.

Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main()

Dispatches to the correct format-specific parser based on media_type.

Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/office_document/model.py
def main(self) -> dict[str, Any] | None:
    """
    Dispatches to the correct format-specific parser based on media_type.
    """
    logger.debug(
        "OfficeAnnotationModel: Starting metadata extraction for '%s'",
        self.file_path,
    )

    media_type = self.media_type
    parser_info = OFFICE_MEDIA_TYPE_MAPPING.get(media_type)
    metadata: dict[str, Any] | None = None

    if parser_info:
        parser_func, variant_name = parser_info
        self.variant = variant_name
        metadata = parser_func(self.file_path)
    else:
        logger.debug("OfficeAnnotationModel: Skipping. Media type '%s' is not an OOXML office file.", media_type)
        return None

    if metadata is None:
        if self.error is None:
            self.error = f"Failed to parse metadata for file: {self.file_path} (parser: {self.variant})"
        logger.warning(self.error)
        return None

    logger.debug(
        "OfficeAnnotationModel: Successfully processed '%s' with parser '%s'",
        self.file_path,
        self.variant,
    )
    return metadata

PDFAnnotationModel

PDFAnnotationModel(file_path)

Bases: AnnotationModel

Extract metadata from PDF files using pypdfium2.

Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main(password=None)

Extract, normalize, and return metadata from the PDF file.

Parameters:

Name Type Description Default
password str | None

Optional password from pipeline config.

None

Returns:

Type Description
dict[str, Any] | None

Dictionary of normalized PDF metadata if successful.

dict[str, Any] | None

None if the PDF cannot be read or essential metadata extraction fails, with self.error set to an appropriate message.

Raises:

Type Description
ImportError

If pypdfium2 is not installed (propagated from utils).

Exception

For other critical, unrecoverable errors from pypdfium2 not handled by pdfium_extract_pdf_metadata.

Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/pdf/model.py
def main(self, password: str | None = None) -> dict[str, Any] | None:
    """Extract, normalize, and return metadata from the PDF file.

    Args:
        password: Optional password from pipeline config.

    Returns:
        Dictionary of normalized PDF metadata if successful.
        None if the PDF cannot be read or essential metadata extraction fails, with `self.error` set to an appropriate message.

    Raises:
        ImportError: If `pypdfium2` is not installed (propagated from utils).
        Exception: For other critical, unrecoverable errors from `pypdfium2` not handled by `pdfium_extract_pdf_metadata`.
    """
    logger.debug("PDFAnnotationModel: Starting metadata extraction for '%s'", self.file_path)

    try:
        raw_metadata = pdfium_extract_pdf_metadata(file_path=self.file_path, password=password)
    except ImportError:
        self.error = "pypdfium2 library not found. Cannot process PDF."
        logger.error(self.error)
        raise
    except Exception as err:
        self.error = f"Unexpected error during raw PDF metadata extraction: {err}"
        logger.exception(
            "PDFAnnotationModel: Unexpected error from pdfium_extract_pdf_metadata for '%s'.",
            self.file_path,
        )
        return None

    if raw_metadata is None:
        self.error = "PDF metadata could not be extracted by pypdfium2 (e.g., encrypted, corrupted, or unreadable)."
        logger.debug(
            "PDFAnnotationModel: Raw metadata extraction failed for '%s'. Error message to be set: %s",
            self.file_path,
            self.error,
        )
        return None

    logger.debug(
        "PDFAnnotationModel: Raw metadata extracted for '%s', proceeding with normalization.",
        self.file_path,
    )

    try:
        normalized_metadata = self._normalize_pdf_metadata(raw_metadata=raw_metadata)
    except Exception as err:
        self.error = f"Failed to normalize extracted PDF metadata: {err}"
        logger.exception(
            "PDFAnnotationModel: Error during _normalize_pdf_metadata for '%s'.",
            self.file_path,
        )
        return None

    logger.debug(
        "PDFAnnotationModel: Metadata normalization complete for '%s'.",
        self.file_path,
    )
    return normalized_metadata