Annotation Models

dorsal.file.annotation_models

EbookAnnotationModel

EbookAnnotationModel(file_path)

Bases: AnnotationModel

Extracts metadata from common ebook formats (currently only supports Epub).

Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py

def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main()

Extracts metadata by dispatching to the correct format-specific parser.

Returns:

Type	Description
`dict[str, Any] \| None`	Dictionary of ebook metadata if successful.
`dict[str, Any] \| None`	None if the format is unsupported or parsing fails.

Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/ebook/model.py

def main(self) -> dict[str, Any] | None:
    """
    Extracts metadata by dispatching to the correct format-specific parser.

    Returns:
      * Dictionary of ebook metadata if successful.
      * None if the format is unsupported or parsing fails.
    """
    logger.debug(
        "EbookAnnotationModel: Starting metadata extraction for '%s'",
        self.file_path,
    )

    try:
        _, ext = os.path.splitext(self.file_path)
        parser_type = EBOOK_FORMAT_MAPPING.get(ext.lower())

        metadata: dict[str, Any] | None = None

        if parser_type == "epub":
            self.variant = "epub_stdlib"
            metadata = ebook_utils.extract_epub_metadata(self.file_path)

        else:
            self.error = f"Unsupported ebook format: '{ext}' for file: {self.file_path}"
            logger.info(self.error)
            return None

        if metadata is None:
            self.error = f"Failed to parse ebook metadata for file: {self.file_path} (parser: {self.variant})"
            logger.warning(self.error)
            return None

        logger.debug(
            "EbookAnnotationModel: Successfully processed '%s' with parser '%s'",
            self.file_path,
            self.variant,
        )
        return metadata

    except ImportError as e:
        self.error = f"Missing dependency for parser '{self.variant}': {e}. Cannot process file."
        logger.error(self.error, exc_info=True)
        return None

    except (FileNotFoundError, IOError, OSError) as e:
        self.error = f"File system error during ebook processing: {type(e).__name__}: {e}"
        logger.error(
            "EbookAnnotationModel: CRITICAL OS/IO Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise

    except Exception as e:
        self.error = f"Unexpected error during EbookAnnotationModel processing: {type(e).__name__}: {e}"
        logger.error(
            "EbookAnnotationModel: UNEXPECTED Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise

FileCoreAnnotationModel

FileCoreAnnotationModel(file_path)

Bases: AnnotationModel

Annotation model for extracting core file metadata.

Calculate file hashes (SHA256, TLSH, BLAKE3 and QUICK).
Determine basic file attributes: name, extension, size and media type.
This model is designed for use in the ModelRunner
Its main method outputs a dictionary conforming to FileCoreValidationModel.

Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py

def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main(calculate_similarity_hash=False)

Main execution method for the FileCoreAnnotationModel.

Orchestrates the extraction of all fundamental file metadata.

Parameters:

Name	Type	Description	Default
`calculate_similarity_hash`	`bool`	If True, the TLSH similarity hash will be calculated and included in the results. Defaults to False.	`False`

Returns:

Type	Description
`dict[str, Any] \| None`	A dictionary containing the extracted file metadata, conforming to
`dict[str, Any] \| None`	the structure expected by `FileCoreValidationModelStrict`.
`dict[str, Any] \| None`	Returns None if a recoverable error specific to this model's logic occurs
`dict[str, Any] \| None`	and `self.error` is set (though current implementation tends to let
`dict[str, Any] \| None`	critical OS/IO errors propagate).

Raises:

Type	Description
`(FileNotFoundError, IOError, OSError)`	If critical issues occur during file access (e.g., for hashing, size, media type determination). These are expected to be caught by the ModelRunner.

Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/base/model.py

def main(self, calculate_similarity_hash: bool = False) -> dict[str, Any] | None:
    """
    Main execution method for the FileCoreAnnotationModel.

    Orchestrates the extraction of all fundamental file metadata.

    Args:
        calculate_similarity_hash: If True, the TLSH similarity hash will be
                                   calculated and included in the results.
                                   Defaults to False.

    Returns:
        A dictionary containing the extracted file metadata, conforming to
        the structure expected by `FileCoreValidationModelStrict`.
        Returns None if a recoverable error specific to this model's logic occurs
        and `self.error` is set (though current implementation tends to let
        critical OS/IO errors propagate).

    Raises:
        FileNotFoundError, IOError, OSError: If critical issues occur during
            file access (e.g., for hashing, size, media type determination).
            These are expected to be caught by the ModelRunner.
    """
    try:
        logger.debug(
            "FileCoreAnnotationModel main: Starting processing for '%s'",
            self.file_path,
        )

        hashes = self._get_file_hashes(calculate_similarity_hash=calculate_similarity_hash)
        primary_hash = hashes.get("SHA-256")
        if not primary_hash:
            self.error = "Core SHA-256 hash calculation failed."
            logger.error(self.error + " File: '%s'", self.file_path)
            return None

        tlsh_hash = hashes.get("TLSH")
        quick_hash = hashes.get("QUICK")

        all_hashes_list = [{"id": hash_name, "value": hash_value} for hash_name, hash_value in hashes.items()]

        file_name = self._get_filename()
        file_extension = self._get_file_extension(file_name=file_name)
        file_size = self._get_filesize()

        media_type = self._get_media_type(file_extension=file_extension)

        logger.debug(
            "FileCoreAnnotationModel main: Successfully processed '%s'",
            self.file_path,
        )
        return {
            "hash": primary_hash,
            "similarity_hash": tlsh_hash,
            "quick_hash": quick_hash,
            "all_hashes": all_hashes_list,
            "name": file_name,
            "extension": file_extension,
            "size": file_size,
            "media_type": media_type,
        }
    except (FileNotFoundError, IOError, OSError) as e:
        self.error = f"File system error during processing: {type(e).__name__}: {e}"
        logger.error(
            "FileCoreAnnotationModel main: CRITICAL OS/IO Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise
    except Exception as e:
        self.error = f"Unexpected error during FileCoreAnnotationModel processing: {type(e).__name__}: {e}"
        logger.error(
            "FileCoreAnnotationModel main: UNEXPECTED Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise

MediaInfoAnnotationModel

MediaInfoAnnotationModel(file_path)

Bases: AnnotationModel

Extract metadata from media files using the pymediainfo library.

This model parses the output of MediaInfo (obtained as JSON) and organizes it into a structured dictionary with a main "General" track and lists for other track types (Video, Audio, Text, etc.).

Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py

def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main()

Extract, normalize, and structure metadata from the media file.

Returns:

Type	Description
`dict[str, Any] \| None`	Dictionary of structured MediaInfo data if successful.
`dict[str, Any] \| None`	None if pymediainfo library is unavailable, file cannot be parsed, or essential data is missing. `self.error` will be set in case of failure.

Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/mediainfo/model.py

def main(self) -> dict[str, Any] | None:
    """
    Extract, normalize, and structure metadata from the media file.

    Returns:
      * Dictionary of structured MediaInfo data if successful.
      * None if pymediainfo library is unavailable, file cannot be parsed, or essential data is missing.
        `self.error` will be set in case of failure.
    """
    if not PYMEDIAINFO_AVAILABLE:
        self.error = f"pymediainfo library is not installed; cannot process media file: '{self.file_path}'"
        logger.error(self.error)
        return None

    logger.debug("MediaInfoAnnotationModel: Starting parsing for '%s'", self.file_path)

    try:
        raw_mediainfo_json_str: str = MediaInfo.parse(filename=self.file_path, output="JSON")
        mediainfo_data = json.loads(raw_mediainfo_json_str)
        logger.debug("MediaInfo.parse and json.loads successful for '%s'", self.file_path)

    except FileNotFoundError:
        self.error = f"Media file not found at path: {self.file_path}"
        logger.error(self.error)
        return None
    except (RuntimeError, OSError) as err:
        self.error = f"pymediainfo failed to parse file '{self.file_path}': {err}"
        logger.exception("MediaInfoAnnotationModel: %s", self.error)
        return None
    except json.JSONDecodeError as err:
        self.error = f"Failed to decode JSON output from MediaInfo for file '{self.file_path}': {err}"
        logger.exception(
            "MediaInfoAnnotationModel: %s. Raw output snippet: %.200s",
            self.error,
            raw_mediainfo_json_str or "",
        )
        return None
    except Exception as err:
        self.error = f"An unexpected error occurred during MediaInfo parsing of '{self.file_path}': {err}"
        logger.exception("MediaInfoAnnotationModel: %s", self.error)
        return None

    try:
        track_list: list[dict[str, Any]] = mediainfo_data["media"]["track"]
        creating_library_data = mediainfo_data.get("creatingLibrary")
    except (KeyError, TypeError) as err:
        self.error = (
            f"MediaInfo JSON output for '{self.file_path}' missing expected structure ('media.track'): {err}"
        )
        logger.exception(
            "MediaInfoAnnotationModel: %s. Data snippet: %s",
            self.error,
            str(mediainfo_data)[:500],
        )
        return None

    normalized_track_list = self._normalize_track_list(track_list=track_list)
    grouped_tracks = self._extract_and_group_tracks(track_list=normalized_track_list)

    if grouped_tracks is None:
        return None

    general_track = grouped_tracks.pop("General")
    final_record = {**general_track, **grouped_tracks}

    final_record["creatingLibrary"] = creating_library_data
    if not creating_library_data:
        logger.debug(
            "MediaInfo output for '%s' did not contain 'creatingLibrary' information.",
            self.file_path,
        )

    logger.debug("MediaInfoAnnotationModel: Successfully processed file '%s'", self.file_path)
    return final_record

OfficeDocumentAnnotationModel

OfficeDocumentAnnotationModel(file_path)

Bases: AnnotationModel

Extracts metadata from Microsoft Office formats (OOXML: .docx, .xlsx, .pptx). This model acts as a dispatcher, calling the correct stdlib-based parser.

Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py

def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main()

Dispatches to the correct format-specific parser based on media_type.

Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/office_document/model.py

def main(self) -> dict[str, Any] | None:
    """
    Dispatches to the correct format-specific parser based on media_type.
    """
    logger.debug(
        "OfficeAnnotationModel: Starting metadata extraction for '%s'",
        self.file_path,
    )

    media_type = self.media_type
    parser_info = OFFICE_MEDIA_TYPE_MAPPING.get(media_type)
    metadata: dict[str, Any] | None = None

    if parser_info:
        parser_func, variant_name = parser_info
        self.variant = variant_name
        metadata = parser_func(self.file_path)
    else:
        logger.debug("OfficeAnnotationModel: Skipping. Media type '%s' is not an OOXML office file.", media_type)
        return None

    if metadata is None:
        if self.error is None:
            self.error = f"Failed to parse metadata for file: {self.file_path} (parser: {self.variant})"
        logger.warning(self.error)
        return None

    logger.debug(
        "OfficeAnnotationModel: Successfully processed '%s' with parser '%s'",
        self.file_path,
        self.variant,
    )
    return metadata

PDFAnnotationModel

PDFAnnotationModel(file_path)

Bases: AnnotationModel

Extract metadata from PDF files using pypdfium2.

Source code in venv/lib/python3.12/site-packages/dorsal/common/model.py

def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main(password=None)

Extract, normalize, and return metadata from the PDF file.

Parameters:

Name	Type	Description	Default
`password`	`str \| None`	Optional password from pipeline config.	`None`

Returns:

Type	Description
`dict[str, Any] \| None`	Dictionary of normalized PDF metadata if successful.
`dict[str, Any] \| None`	None if the PDF cannot be read or essential metadata extraction fails, with `self.error` set to an appropriate message.

Raises:

Type	Description
`ImportError`	If `pypdfium2` is not installed (propagated from utils).
`Exception`	For other critical, unrecoverable errors from `pypdfium2` not handled by `pdfium_extract_pdf_metadata`.

Source code in venv/lib/python3.12/site-packages/dorsal/file/annotation_models/pdf/model.py

def main(self, password: str | None = None) -> dict[str, Any] | None:
    """Extract, normalize, and return metadata from the PDF file.

    Args:
        password: Optional password from pipeline config.

    Returns:
        Dictionary of normalized PDF metadata if successful.
        None if the PDF cannot be read or essential metadata extraction fails, with `self.error` set to an appropriate message.

    Raises:
        ImportError: If `pypdfium2` is not installed (propagated from utils).
        Exception: For other critical, unrecoverable errors from `pypdfium2` not handled by `pdfium_extract_pdf_metadata`.
    """
    logger.debug("PDFAnnotationModel: Starting metadata extraction for '%s'", self.file_path)

    try:
        raw_metadata = pdfium_extract_pdf_metadata(file_path=self.file_path, password=password)
    except ImportError:
        self.error = "pypdfium2 library not found. Cannot process PDF."
        logger.error(self.error)
        raise
    except Exception as err:
        self.error = f"Unexpected error during raw PDF metadata extraction: {err}"
        logger.exception(
            "PDFAnnotationModel: Unexpected error from pdfium_extract_pdf_metadata for '%s'.",
            self.file_path,
        )
        return None

    if raw_metadata is None:
        self.error = "PDF metadata could not be extracted by pypdfium2 (e.g., encrypted, corrupted, or unreadable)."
        logger.debug(
            "PDFAnnotationModel: Raw metadata extraction failed for '%s'. Error message to be set: %s",
            self.file_path,
            self.error,
        )
        return None

    logger.debug(
        "PDFAnnotationModel: Raw metadata extracted for '%s', proceeding with normalization.",
        self.file_path,
    )

    try:
        normalized_metadata = self._normalize_pdf_metadata(raw_metadata=raw_metadata)
    except Exception as err:
        self.error = f"Failed to normalize extracted PDF metadata: {err}"
        logger.exception(
            "PDFAnnotationModel: Error during _normalize_pdf_metadata for '%s'.",
            self.file_path,
        )
        return None

    logger.debug(
        "PDFAnnotationModel: Metadata normalization complete for '%s'.",
        self.file_path,
    )
    return normalized_metadata