Skip to content

Core Annotation Models

Core Annotation Models extract core metadata fields from files.

Core metadata fields simply describe something meaningful about a certain kind of file. Examples of a core fields would be page_count, which is relevant for certain document types, such as PDF, or size, which is relevant for any file of any type.

Most Core Annotation Models are specific to one or more file types, such as PDF documents or Zip Archives.

This page details the core Annotation models currently available in Dorsal and the file types they support.

In each case a code example is provided, however, in practice you would almost always run the models as part of a pipeline orchestrated by the ModelRunner class.

Model Schema ID Extracts Runs On
FileCoreAnnotationModel file/base General metadata (file hashes, name, size, media type). All files
EbookAnnotationModel file/ebook Ebook metadata (e.g. title, authors, isbn). Ebooks (Only EPUB currently supported)
OfficeAnnotationModel file/office Office Doc metadata (e.g. author, page_count, sheets). Audio/Video/Image file formats
MediaInfoAnnotationModel file/mediainfo Media file metadata (e.g. codecs, duration, dimensions). Office Documents (.docx, .xlsx, .pptx)
PDFAnnotationModel file/pdf PDF-specific metadata (e.g. pages, creator, dates). PDF documents

dorsal.file.annotation_models.base.model.FileCoreAnnotationModel

FileCoreAnnotationModel

Bases: AnnotationModel

Annotation model for extracting core file metadata.

  • Calculate file hashes (SHA256, TLSH, BLAKE3 and QUICK).
  • Determine basic file attributes: name, extension, size and media type.
  • This model is designed for use in the ModelRunner
  • Its main method outputs a dictionary conforming to FileCoreValidationModel.
Source code in venv/lib/python3.13/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main

Main execution method for the FileCoreAnnotationModel.

Orchestrates the extraction of all fundamental file metadata.

Parameters:

Name Type Description Default
calculate_similarity_hash bool

If True, the TLSH similarity hash will be calculated and included in the results. Defaults to False.

False

Returns:

Type Description
dict[str, Any] | None

A dictionary containing the extracted file metadata, conforming to

dict[str, Any] | None

the structure expected by FileCoreValidationModelStrict.

dict[str, Any] | None

Returns None if a recoverable error specific to this model's logic occurs

dict[str, Any] | None

and self.error is set (though current implementation tends to let

dict[str, Any] | None

critical OS/IO errors propagate).

Raises:

Type Description
(FileNotFoundError, IOError, OSError)

If critical issues occur during file access (e.g., for hashing, size, media type determination). These are expected to be caught by the ModelRunner.

Source code in venv/lib/python3.13/site-packages/dorsal/file/annotation_models/base/model.py
def main(self, calculate_similarity_hash: bool = False) -> dict[str, Any] | None:
    """
    Main execution method for the FileCoreAnnotationModel.

    Orchestrates the extraction of all fundamental file metadata.

    Args:
        calculate_similarity_hash: If True, the TLSH similarity hash will be
                                   calculated and included in the results.
                                   Defaults to False.

    Returns:
        A dictionary containing the extracted file metadata, conforming to
        the structure expected by `FileCoreValidationModelStrict`.
        Returns None if a recoverable error specific to this model's logic occurs
        and `self.error` is set (though current implementation tends to let
        critical OS/IO errors propagate).

    Raises:
        FileNotFoundError, IOError, OSError: If critical issues occur during
            file access (e.g., for hashing, size, media type determination).
            These are expected to be caught by the ModelRunner.
    """
    try:
        logger.debug(
            "FileCoreAnnotationModel main: Starting processing for '%s'",
            self.file_path,
        )

        hashes = self._get_file_hashes(calculate_similarity_hash=calculate_similarity_hash)
        primary_hash = hashes.get("SHA-256")
        if not primary_hash:
            self.error = "Core SHA-256 hash calculation failed."
            logger.error(self.error + " File: '%s'", self.file_path)
            return None

        tlsh_hash = hashes.get("TLSH")
        quick_hash = hashes.get("QUICK")

        all_hashes_list = [{"id": hash_name, "value": hash_value} for hash_name, hash_value in hashes.items()]

        file_name = self._get_filename()
        file_extension = self._get_file_extension(file_name=file_name)
        file_size = self._get_filesize()

        media_type = self._get_media_type(file_extension=file_extension)

        logger.debug(
            "FileCoreAnnotationModel main: Successfully processed '%s'",
            self.file_path,
        )
        return {
            "hash": primary_hash,
            "similarity_hash": tlsh_hash,
            "quick_hash": quick_hash,
            "all_hashes": all_hashes_list,
            "name": file_name,
            "extension": file_extension,
            "size": file_size,
            "media_type": media_type,
        }
    except (FileNotFoundError, IOError, OSError) as e:
        self.error = f"File system error during processing: {type(e).__name__}: {e}"
        logger.error(
            "FileCoreAnnotationModel main: CRITICAL OS/IO Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise
    except Exception as e:
        self.error = f"Unexpected error during FileCoreAnnotationModel processing: {type(e).__name__}: {e}"
        logger.error(
            "FileCoreAnnotationModel main: UNEXPECTED Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise

Code Example

FileCoreAnnotationModel is the most general of the Core Annotation Models, running on every file as the first step in the default Annotation Model pipeline.

It extracts general file metadata, including file hashes, name, size and media type.

In the example below, the FileCoreAnnotationModel is run for a single file:

from dorsal.file.annotation_models.base.model import FileCoreAnnotationModel

model = FileCoreAnnotationModel("./big_buck_bunny_1080p_h264.mov")  # Path to file on local system
model.main(calculate_similarity_hash=True)  # The output of the `main` method returns the annotation
Output
{
  "hash": "dc2146a2b1172def56730143ad80cd1825b7fad15f1fc9c23a4e7d01a741ac11",
  "similarity_hash": "T1AE7933F5A7329607815E36F8EB025F12DC44FC931E3E976A339B12B91E853256C63B18",
  "quick_hash": "9810dd75bba04f061f0ea52021930edd2d24fc895850b8eeeb659d17e13b64a6",
  "all_hashes": [
    {
      "id": "SHA-256",
      "value": "dc2146a2b1172def56730143ad80cd1825b7fad15f1fc9c23a4e7d01a741ac11"
    },
    {
      "id": "BLAKE3",
      "value": "6e2705bec1ae55bbf3ddd0d44305bf83fa847339bb32f05494a307b7ff223ac4"
    },
    {
      "id": "TLSH",
      "value": "T1AE7933F5A7329607815E36F8EB025F12DC44FC931E3E976A339B12B91E853256C63B18"
    },
    {
      "id": "QUICK",
      "value": "9810dd75bba04f061f0ea52021930edd2d24fc895850b8eeeb659d17e13b64a6"
    }
  ],
  "name": "big_buck_bunny_1080p_h264.mov",
  "extension": ".mov",
  "size": 725106140,
  "media_type": "video/quicktime"
}

dorsal.file.annotation_models.ebook.model.EbookAnnotationModel

EbookAnnotationModel

Bases: AnnotationModel

Extracts metadata from common ebook formats (e.g., EPUB, MOBI).

Source code in venv/lib/python3.13/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main

Extracts metadata by dispatching to the correct format-specific parser.

Returns:

Type Description
dict[str, Any] | None
  • Dictionary of ebook metadata if successful.
dict[str, Any] | None
  • None if the format is unsupported or parsing fails.
Source code in venv/lib/python3.13/site-packages/dorsal/file/annotation_models/ebook/model.py
def main(self) -> dict[str, Any] | None:
    """
    Extracts metadata by dispatching to the correct format-specific parser.

    Returns:
      * Dictionary of ebook metadata if successful.
      * None if the format is unsupported or parsing fails.
    """
    logger.debug(
        "EbookAnnotationModel: Starting metadata extraction for '%s'",
        self.file_path,
    )

    try:
        _, ext = os.path.splitext(self.file_path)
        parser_type = EBOOK_FORMAT_MAPPING.get(ext.lower())

        metadata: dict[str, Any] | None = None

        if parser_type == "epub":
            self.variant = "epub_stdlib"
            metadata = ebook_utils.extract_epub_metadata(self.file_path)

        else:
            self.error = f"Unsupported ebook format: '{ext}' for file: {self.file_path}"
            logger.info(self.error)
            return None

        if metadata is None:
            self.error = f"Failed to parse ebook metadata for file: {self.file_path} (parser: {self.variant})"
            logger.warning(self.error)
            return None

        logger.debug(
            "EbookAnnotationModel: Successfully processed '%s' with parser '%s'",
            self.file_path,
            self.variant,
        )
        return metadata

    except ImportError as e:
        self.error = f"Missing dependency for parser '{self.variant}': {e}. Cannot process file."
        logger.error(self.error, exc_info=True)
        return None

    except (FileNotFoundError, IOError, OSError) as e:
        self.error = f"File system error during ebook processing: {type(e).__name__}: {e}"
        logger.error(
            "EbookAnnotationModel: CRITICAL OS/IO Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise

    except Exception as e:
        self.error = f"Unexpected error during EbookAnnotationModel processing: {type(e).__name__}: {e}"
        logger.error(
            "EbookAnnotationModel: UNEXPECTED Error for '%s'. Error: %s",
            self.file_path,
            self.error,
            exc_info=True,
        )
        raise

Code Example

EbookAnnotationModel extracts metadata from Ebook files. Currently only EPUB format is supported. Fields extracted include title, authors, publisher and isbn.

In the example below, the EbookAnnotationModel is run for a single file:

from dorsal.file.annotation_models.epub.model import EbookAnnotationModel

model = EbookAnnotationModel("./books/Stephenson - The Diamond Age.epub")  # Path to file on local system
model.main()  # The output of the `main` method returns the annotation
Output
{
  'title': 'The Diamond Age',
  'authors': ['Neal Stephenson'],
  'contributors': [],
  'publisher': 'Random House Publishing Group',
  'language': 'English',
  'subjects': [],
  'description': None,
  'rights': 'Copyright 2003',
  'isbn': '9780553898200',
  'other_identifiers': [],
  'cover_path': None,
  'tools': [],
  'publication_date': datetime.datetime(2003, 6, 18, 0, 0),
  'creation_date': None,
  'modification_date': None
}
File Extension Media Type
.epub application/epub+zip

dorsal.file.annotation_models.office_document.model.OfficeDocumentAnnotationModel

OfficeDocumentAnnotationModel

Bases: AnnotationModel

Extracts metadata from Microsoft Office formats (OOXML: .docx, .xlsx, .pptx). This model acts as a dispatcher, calling the correct stdlib-based parser.

Source code in venv/lib/python3.13/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main

Dispatches to the correct format-specific parser based on media_type.

Source code in venv/lib/python3.13/site-packages/dorsal/file/annotation_models/office_document/model.py
def main(self) -> dict[str, Any] | None:
    """
    Dispatches to the correct format-specific parser based on media_type.
    """
    logger.debug(
        "OfficeAnnotationModel: Starting metadata extraction for '%s'",
        self.file_path,
    )

    media_type = self.media_type
    parser_info = OFFICE_MEDIA_TYPE_MAPPING.get(media_type)
    metadata: dict[str, Any] | None = None

    if parser_info:
        parser_func, variant_name = parser_info
        self.variant = variant_name
        metadata = parser_func(self.file_path)
    else:
        logger.debug("OfficeAnnotationModel: Skipping. Media type '%s' is not an OOXML office file.", media_type)
        return None

    if metadata is None:
        if self.error is None:
            self.error = f"Failed to parse metadata for file: {self.file_path} (parser: {self.variant})"
        logger.warning(self.error)
        return None

    logger.debug(
        "OfficeAnnotationModel: Successfully processed '%s' with parser '%s'",
        self.file_path,
        self.variant,
    )
    return metadata

Code Example

OfficeDocumentAnnotationModel extracts metadata from Microsoft Office formats (OOXML: .docx, .xlsx, .pptx). It acts as a dispatcher, calling the correct format-specific parser based on the file's media type.

It extracts common properties like author, title, and creation date, as well as format-specific details, such as page/word counts for Word documents, sheet names for Excel, and slide counts for PowerPoint.

In the example below, the OfficeDocumentAnnotationModel is run for a single .docx file:

from dorsal.file.annotation_models.office_document.model import OfficeDocumentAnnotationModel

model = OfficeDocumentAnnotationModel("./reports/Q3_Report.docx")  # Path to file on local system
model.main()  # The output of the `main` method returns the annotation
Output (for a .docx file)
{
  'author': 'Jane Doe',
  'last_modified_by': 'John Smith',
  'title': 'Q3 Financial Report',
  'subject': 'Quarterly Earnings',
  'keywords': ['finance', 'report', 'Q3'],
  'revision': 3,
  'creation_date': datetime.datetime(2025, 10, 1, 9, 0, 0),
  'modified_date': datetime.datetime(2025, 10, 5, 14, 30, 15),
  'application_name': 'Microsoft Office Word',
  'application_version': '16.0300',
  'template': 'CompanyReport.dotx',
  'structural_parts': [
    'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml',
    'application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml',
    'application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml',
  ],
  'has_comments': True,
  'custom_properties': {
    'Department': 'Finance',
    'Status': 'Draft'
  },
  'language': 'English',
  'language_code': 'eng',
  'locale_code': 'en-US',
  'default_font': 'Calibri',
  'all_fonts': ['Calibri', 'Times New Roman'],
  'is_password_protected': False,
  'word': {
    'page_count': 15,
    'word_count': 3450,
    'char_count': 18970,
    'paragraph_count': 120,
    'has_track_changes': True,
    'hyperlinks': ['https://dorsalhub.com'],
    'embedded_images': ['word/media/image1.png']
  },
  'excel': None,
  'powerpoint': None
}
File Extension Media Type
.docx application/vnd.openxmlformats-officedocument.wordprocessingml.document
.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.pptx application/vnd.openxmlformats-officedocument.presentationml.presentation

dorsal.file.annotation_models.pdf.model.PDFAnnotationModel

PDFAnnotationModel

Bases: AnnotationModel

Extract metadata from PDF files using pypdfium2.

Source code in venv/lib/python3.13/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main

Extract, normalize, and return metadata from the PDF file.

Parameters:

Name Type Description Default
password str | None

Optional password from pipeline config.

None

Returns:

Type Description
dict[str, Any] | None

Dictionary of normalized PDF metadata if successful.

dict[str, Any] | None

None if the PDF cannot be read or essential metadata extraction fails, with self.error set to an appropriate message.

Raises:

Type Description
ImportError

If pypdfium2 is not installed (propagated from utils).

Source code in venv/lib/python3.13/site-packages/dorsal/file/annotation_models/pdf/model.py
def main(self, password: str | None = None) -> dict[str, Any] | None:
    """Extract, normalize, and return metadata from the PDF file.

    Args:
        password: Optional password from pipeline config.

    Returns:
        Dictionary of normalized PDF metadata if successful.
        None if the PDF cannot be read or essential metadata extraction fails, with `self.error` set to an appropriate message.

    Raises:
        ImportError: If `pypdfium2` is not installed (propagated from utils).
        Other exceptions from `pypdfium2` for critical, unrecoverable errors not handled by `pdfium_extract_pdf_metadata`.

    """
    logger.debug("PDFAnnotationModel: Starting metadata extraction for '%s'", self.file_path)

    try:
        raw_metadata = pdfium_extract_pdf_metadata(file_path=self.file_path, password=password)
    except ImportError:
        self.error = "pypdfium2 library not found. Cannot process PDF."
        logger.error(self.error)
        raise
    except Exception as err:
        self.error = f"Unexpected error during raw PDF metadata extraction: {err}"
        logger.exception(
            "PDFAnnotationModel: Unexpected error from pdfium_extract_pdf_metadata for '%s'.",
            self.file_path,
        )
        return None

    if raw_metadata is None:
        self.error = "PDF metadata could not be extracted by pypdfium2 (e.g., encrypted, corrupted, or unreadable)."
        logger.debug(
            "PDFAnnotationModel: Raw metadata extraction failed for '%s'. Error message to be set: %s",
            self.file_path,
            self.error,
        )
        return None

    logger.debug(
        "PDFAnnotationModel: Raw metadata extracted for '%s', proceeding with normalization.",
        self.file_path,
    )

    try:
        normalized_metadata = self._normalize_pdf_metadata(raw_metadata=raw_metadata)
    except Exception as err:
        self.error = f"Failed to normalize extracted PDF metadata: {err}"
        logger.exception(
            "PDFAnnotationModel: Error during _normalize_pdf_metadata for '%s'.",
            self.file_path,
        )
        return None

    logger.debug(
        "PDFAnnotationModel: Metadata normalization complete for '%s'.",
        self.file_path,
    )
    return normalized_metadata

Code Example

PDFAnnotationModel extracts metadata from PDF documents, such as page count, creator, version, and creation dates, using the pypdfium2 library.

In the example below, the PDFAnnotationModel is run for a single file:

from dorsal.file.annotation_models.pdf.model import PDFAnnotationModel

model = PDFAnnotationModel("./PDFSPEC.pdf")  # Path to file on local system
model.main()  # The output of the `main` method returns the annotation
Output
{
  'creator': 'FrameMaker 5.1.1',
  'producer': 'Acrobat Distiller 3.0 for Power Macintosh',
  'creation_date': datetime.datetime(1996, 11, 12, 3, 8, 43),
  'author': 'Tim Bienz, Richard Cohn, James R. Meehan',
  'modified_date': datetime.datetime(1996, 11, 12, 7, 58, 15),
  'title': 'Portable Document Format Reference Manual (v 1.2)',
  'subject': 'Description of the PDF file format',
  'keywords': 'Acrobat PDF',
  'version': '1.2',
  'page_count': 394
}
File Extension Media Type
.pdf application/pdf

dorsal.file.annotation_models.mediainfo.model.MediaInfoAnnotationModel

MediaInfoAnnotationModel

Bases: AnnotationModel

Extract metadata from media files using the pymediainfo library.

This model parses the output of MediaInfo (obtained as JSON) and organizes it into a structured dictionary with a main "General" track and lists for other track types (Video, Audio, Text, etc.).

Source code in venv/lib/python3.13/site-packages/dorsal/common/model.py
def __init__(self, file_path: str):
    """Initializes the model, setting the file_path."""
    self.file_path = file_path
    self.error: str | None = None
    self.name: str | None = None
    self.extension: str | None = None
    self.size: int | None = None
    self.media_type: str | None = None
    self.hash: str | None = None
    self.similarity_hash: str | None = None
    self.quick_hash: str | None = None

main

main

Extract, normalize, and structure metadata from the media file.

Returns:

Type Description
dict[str, Any] | None
  • Dictionary of structured MediaInfo data if successful.
dict[str, Any] | None
  • None if pymediainfo library is unavailable, file cannot be parsed, or essential data is missing. self.error will be set in case of failure.
Source code in venv/lib/python3.13/site-packages/dorsal/file/annotation_models/mediainfo/model.py
def main(self) -> dict[str, Any] | None:
    """
    Extract, normalize, and structure metadata from the media file.

    Returns:
      * Dictionary of structured MediaInfo data if successful.
      * None if pymediainfo library is unavailable, file cannot be parsed, or essential data is missing.
        `self.error` will be set in case of failure.
    """
    if not PYMEDIAINFO_AVAILABLE:
        self.error = f"pymediainfo library is not installed; cannot process media file: '{self.file_path}'"
        logger.error(self.error)
        return None

    logger.debug("MediaInfoAnnotationModel: Starting parsing for '%s'", self.file_path)

    try:
        raw_mediainfo_json_str: str = MediaInfo.parse(filename=self.file_path, output="JSON")
        mediainfo_data = json.loads(raw_mediainfo_json_str)
        logger.debug("MediaInfo.parse and json.loads successful for '%s'", self.file_path)

    except FileNotFoundError:
        self.error = f"Media file not found at path: {self.file_path}"
        logger.error(self.error)
        return None
    except (RuntimeError, OSError) as err:
        self.error = f"pymediainfo failed to parse file '{self.file_path}': {err}"
        logger.exception("MediaInfoAnnotationModel: %s", self.error)
        return None
    except json.JSONDecodeError as err:
        self.error = f"Failed to decode JSON output from MediaInfo for file '{self.file_path}': {err}"
        logger.exception(
            "MediaInfoAnnotationModel: %s. Raw output snippet: %.200s",
            self.error,
            raw_mediainfo_json_str or "",
        )
        return None
    except Exception as err:
        self.error = f"An unexpected error occurred during MediaInfo parsing of '{self.file_path}': {err}"
        logger.exception("MediaInfoAnnotationModel: %s", self.error)
        return None

    try:
        track_list: list[dict[str, Any]] = mediainfo_data["media"]["track"]
        creating_library_data = mediainfo_data.get("creatingLibrary")
    except (KeyError, TypeError) as err:
        self.error = (
            f"MediaInfo JSON output for '{self.file_path}' missing expected structure ('media.track'): {err}"
        )
        logger.exception(
            "MediaInfoAnnotationModel: %s. Data snippet: %s",
            self.error,
            str(mediainfo_data)[:500],
        )
        return None

    normalized_track_list = self._normalize_track_list(track_list=track_list)
    grouped_tracks = self._extract_and_group_tracks(track_list=normalized_track_list)

    if grouped_tracks is None:
        return None

    general_track = grouped_tracks.pop("General")
    final_record = {**general_track, **grouped_tracks}

    final_record["creatingLibrary"] = creating_library_data
    if not creating_library_data:
        logger.debug(
            "MediaInfo output for '%s' did not contain 'creatingLibrary' information.",
            self.file_path,
        )

    logger.debug("MediaInfoAnnotationModel: Successfully processed file '%s'", self.file_path)
    return final_record

Code Example

MediaInfoAnnotationModel extracts technical metadata from a wide variety of audio, video, and image files using the underlying MediaInfo library (via pymediainfo).

In the example below, the MediaInfoAnnotationModel is run for a single file:

from dorsal.file.annotation_models.mediainfo.model import MediaInfoAnnotationModel

model = MediaInfoAnnotationModel("./big_buck_bunny_1080p_h264.mov")  # Path to file on local system
model.main()
Output
{
    "Audio_Codec_List": "AAC LC",
    "AudioCount": 1,
    "Audio_Channels_Total": 6,
    "Audio_Format_List": "AAC LC",
    "Audio_Format_WithHint_List": "AAC LC",
    "Audio_Language_List": "English",
    "CodecID_Compatible": "qt  ",
    "CodecID": "qt  ",
    "CodecID_String": "qt   2005.03 (qt  )",
    "CodecID_Url": "http://www.apple.com/quicktime/download/standalone.html",
    "CodecID_Version": "2005.03",
    "Count": 359,
    "DataSize": 724711422,
    "Duration": 596.462,
    "Duration_String1": "9 min 56 s 462 ms",
    "Duration_String2": "9 min 56 s",
    "Duration_String3": "00:09:56.462",
    "Duration_String4": "00:09:56:11",
    "Duration_String5": "00:09:56.462 (00:09:56:11)",
    "Duration_String": "9 min 56 s",
    "Encoded_Date": "2008-05-27 18:40:35 UTC",
    "Encoded_Library": "Apple QuickTime 7.4.1",
    "Encoded_Library_Name": "Apple QuickTime",
    "Encoded_Library_String": "Apple QuickTime 7.4.1",
    "Encoded_Library_Version": "7.4.1",
    "FileExtension": "mov",
    "File_Modified_Date_Local": "2024-05-16 16:58:01",
    "File_Modified_Date": "2024-05-16 15:58:01 UTC",
    "FileNameExtension": "big_buck_bunny_1080p_h264.mov",
    "FileName": "big_buck_bunny_1080p_h264",
    "FileSize": "725106140",
    "FileSize_String1": "692 MiB",
    "FileSize_String2": "692 MiB",
    "FileSize_String3": "692 MiB",
    "FileSize_String4": "691.5 MiB",
    "FileSize_String": "692 MiB",
    "FooterSize": 0,
    "Format_Commercial": "MPEG-4",
    "Format_Extensions": "braw mov mp4 m4v m4a m4b m4p m4r 3ga 3gpa 3gpp 3gp 3gpp2 3g2 k3g jpm jpx mqv ismv isma ismt f4a f4b f4v",
    "Format": "MPEG-4",
    "Format_Profile": "QuickTime",
    "Format_String": "MPEG-4",
    "FrameCount": 14315,
    "FrameRate": 24.0,
    "FrameRate_String": "24.000 FPS",
    "HeaderSize": 394718,
    "InternetMediaType": "video/mp4",
    "IsStreamable": "Yes",
    "Other_Codec_List": "QuickTime TC",
    "OtherCount": 1,
    "Other_Format_List": "QuickTime TC",
    "Other_Format_WithHint_List": "QuickTime TC",
    "Other_Language_List": "English",
    "OverallBitRate": 9725429.0,
    "OverallBitRate_String": "9 725 kb/s",
    "StreamCount": 1,
    "StreamKindID": 0,
    "StreamKind": "General",
    "StreamKind_String": "General",
    "StreamSize": 395728,
    "StreamSize_Proportion": "0.00055",
    "StreamSize_String1": "386 KiB",
    "StreamSize_String2": "386 KiB",
    "StreamSize_String3": "386 KiB",
    "StreamSize_String4": "386.5 KiB",
    "StreamSize_String5": "386 KiB (0%)",
    "StreamSize_String": "386 KiB (0%)",
    "Tagged_Date": "2008-05-27 18:43:05 UTC",
    "Video_Codec_List": "AVC",
    "VideoCount": 1,
    "Video_Format_List": "AVC",
    "Video_Format_WithHint_List": "AVC",
    "Video_Language_List": "English",
    "extra": {
      "com_apple_quicktime_player_movie_audio_gain": "1.000",
      "com_apple_quicktime_player_movie_audio_treble": "0.000",
      "com_apple_quicktime_player_movie_audio_bass": "0.000",
      "com_apple_quicktime_player_movie_audio_balance": "0.000",
      "com_apple_quicktime_player_movie_audio_pitchshift": "0.000",
      "com_apple_quicktime_player_movie_audio_mute": "(Binary)",
      "com_apple_quicktime_player_movie_visual_brightness": "0.000",
      "com_apple_quicktime_player_movie_visual_color": "1.000",
      "com_apple_quicktime_player_movie_visual_tint": "0.000",
      "com_apple_quicktime_player_movie_visual_contrast": "1.000"
    },
    "Audio": [
      {
        "BitRate": 448000.0,
        "BitRate_Mode": "CBR",
        "BitRate_Mode_String": "Constant",
        "BitRate_String": "448 kb/s",
        "ChannelLayout": "C L R Ls Rs LFE",
        "ChannelPositions": "Front: L C R, Side: L R, LFE",
        "ChannelPositions_String2": "3/2/0.1",
        "Channels": 6,
        "Channels_String": "6 channels",
        "CodecID": "mp4a-40-2",
        "Compression_Mode": "Lossy",
        "Compression_Mode_String": "Lossy",
        "Count": 285,
        "Delay_DropFrame": "No",
        "Delay": 0.0,
        "Delay_Source": "Container",
        "Delay_Source_String": "Container",
        "Delay_String3": "00:00:00.000",
        "Delay_String5": "00:00:00.000",
        "Duration": 596.462,
        "Duration_String1": "9 min 56 s 462 ms",
        "Duration_String2": "9 min 56 s",
        "Duration_String3": "00:09:56.462",
        "Duration_String5": "00:09:56.462",
        "Duration_String": "9 min 56 s",
        "Encoded_Date": "2008-05-27 18:40:12 UTC",
        "Format_AdditionalFeatures": "LC",
        "Format_Commercial": "AAC",
        "Format_Info": "Advanced Audio Codec Low Complexity",
        "Format": "AAC",
        "Format_String": "AAC LC",
        "FrameCount": 27959,
        "FrameRate": 46.875,
        "FrameRate_String": "46.875 FPS (1024 SPF)",
        "ID": "3",
        "ID_String": "3",
        "Language": "en",
        "Language_String1": "English",
        "Language_String2": "en",
        "Language_String3": "eng",
        "Language_String4": "en",
        "Language_String": "English",
        "SamplesPerFrame": 1024.0,
        "SamplingCount": 28630176,
        "SamplingRate": 48000.0,
        "SamplingRate_String": "48.0 kHz",
        "Source_Duration": 596.48,
        "Source_Duration_String1": "9 min 56 s 480 ms",
        "Source_Duration_String2": "9 min 56 s",
        "Source_Duration_String3": "00:09:56.480",
        "Source_Duration_String5": "00:09:56.480",
        "Source_Duration_String": "9 min 56 s",
        "Source_FrameCount": 27960,
        "Source_StreamSize": 32627874,
        "Source_StreamSize_Proportion": "0.04500",
        "Source_StreamSize_String1": "31 MiB",
        "Source_StreamSize_String2": "31 MiB",
        "Source_StreamSize_String3": "31.1 MiB",
        "Source_StreamSize_String4": "31.12 MiB",
        "Source_StreamSize_String5": "31.1 MiB (4%)",
        "Source_StreamSize_String": "31.1 MiB (4%)",
        "StreamCount": 1,
        "StreamKindID": 0,
        "StreamKind": "Audio",
        "StreamKind_String": "Audio",
        "StreamOrder": "2",
        "StreamSize": 32626892,
        "StreamSize_Proportion": "0.04500",
        "StreamSize_String1": "31 MiB",
        "StreamSize_String2": "31 MiB",
        "StreamSize_String3": "31.1 MiB",
        "StreamSize_String4": "31.12 MiB",
        "StreamSize_String5": "31.1 MiB (4%)",
        "StreamSize_String": "31.1 MiB (4%)",
        "Tagged_Date": "2008-05-27 18:43:05 UTC",
        "Video_Delay": 0.0,
        "Video_Delay_String3": "00:00:00.000",
        "Video_Delay_String5": "00:00:00.000"
      }
    ],
    "Other": [
      {
        "Count": 195,
        "Duration": 596.458,
        "Duration_String1": "9 min 56 s 458 ms",
        "Duration_String2": "9 min 56 s",
        "Duration_String3": "00:09:56.458",
        "Duration_String4": "00:09:56:11",
        "Duration_String5": "00:09:56.458 (00:09:56:11)",
        "Duration_String": "9 min 56 s",
        "Format_Commercial": "QuickTime TC",
        "Format": "QuickTime TC",
        "Format_String": "QuickTime TC",
        "FrameCount": 14315,
        "FrameRate_Den": 1,
        "FrameRate": 24.0,
        "FrameRate_Num": 24,
        "FrameRate_String": "24.000 FPS",
        "ID": "2",
        "ID_String": "2",
        "Language": "en",
        "Language_String1": "English",
        "Language_String2": "en",
        "Language_String3": "eng",
        "Language_String4": "en",
        "Language_String": "English",
        "StreamCount": 1,
        "StreamKindID": 0,
        "StreamKind": "Other",
        "StreamKind_String": "Other",
        "StreamOrder": "1",
        "TimeCode_DropFrame": "No",
        "TimeCode_FirstFrame": "00:00:00:00",
        "TimeCode_LastFrame": "00:09:56:10",
        "TimeCode_Stripped": "Yes",
        "TimeCode_Stripped_String": "Yes",
        "Type": "Time code",
        "extra": {
          "Encoded_Date": "2008-04-21 20:24:31 UTC",
          "Tagged_Date": "2008-05-27 18:43:05 UTC"
        }
      }
    ],
    "Video": [
      {
        "BitDepth": 8,
        "BitDepth_String": "8 bits",
        "BitRate": 9282573.0,
        "BitRate_String": "9 283 kb/s",
        "BitsPixel_Frame": 0.187,
        "ChromaSubsampling": "4:2:0",
        "ChromaSubsampling_Position": "Type 2",
        "ChromaSubsampling_String": "4:2:0 (Type 2)",
        "CodecID_Info": "Advanced Video Coding",
        "CodecID": "avc1",
        "ColorSpace": "YUV",
        "colour_description_present": "Yes",
        "colour_description_present_Source": "Container / Stream",
        "colour_primaries": "BT.709",
        "colour_primaries_Source": "Container / Stream",
        "colour_range": "Limited",
        "colour_range_Source": "Stream",
        "Count": 391,
        "Delay_DropFrame": "No",
        "Delay": 0.0,
        "Delay_Settings": "DropFrame=No / 24HourMax=No / IsVisual=No",
        "Delay_Source": "Container",
        "Delay_Source_String": "Container",
        "Delay_String3": "00:00:00.000",
        "Delay_String4": "00:00:00:00",
        "Delay_String5": "00:00:00.000 (00:00:00:00)",
        "DisplayAspectRatio": 1.778,
        "DisplayAspectRatio_String": "16:9",
        "Duration": 596.458,
        "Duration_String1": "9 min 56 s 458 ms",
        "Duration_String2": "9 min 56 s",
        "Duration_String3": "00:09:56.458",
        "Duration_String4": "00:09:56:11",
        "Duration_String5": "00:09:56.458 (00:09:56:11)",
        "Duration_String": "9 min 56 s",
        "Encoded_Date": "2008-04-21 20:24:31 UTC",
        "Format_Commercial": "AVC",
        "Format_Info": "Advanced Video Codec",
        "Format_Level": "4.1",
        "Format": "AVC",
        "Format_Profile": "Main",
        "Format_Settings_CABAC": "No",
        "Format_Settings_CABAC_String": "No",
        "Format_Settings_GOP": "M=2, N=24",
        "Format_Settings": "2 Ref Frames",
        "Format_Settings_RefFrames": 2,
        "Format_Settings_RefFrames_String": "2 frames",
        "Format_Settings_SliceCount": 8,
        "Format_Settings_SliceCount_String": "8 slices per frame",
        "Format_String": "AVC",
        "Format_Url": "http://developers.videolan.org/x264.html",
        "FrameCount": 14315,
        "FrameRate_Den": 1,
        "FrameRate": 24.0,
        "FrameRate_Mode": "CFR",
        "FrameRate_Mode_String": "Constant",
        "FrameRate_Num": 24,
        "FrameRate_String": "24.000 FPS",
        "Height": 1080,
        "Height_String": "1 080 pixels",
        "ID": "1",
        "ID_String": "1",
        "InternetMediaType": "video/H264",
        "Language": "en",
        "Language_String1": "English",
        "Language_String2": "en",
        "Language_String3": "eng",
        "Language_String4": "en",
        "Language_String": "English",
        "matrix_coefficients": "BT.709",
        "matrix_coefficients_Source": "Container / Stream",
        "PixelAspectRatio": 1.0,
        "Rotation": "0.000",
        "Sampled_Height": 1080,
        "Sampled_Width": 1920,
        "ScanType": "Progressive",
        "ScanType_String": "Progressive",
        "Stored_Height": 1088,
        "StreamCount": 1,
        "StreamKindID": 0,
        "StreamKind": "Video",
        "StreamKind_String": "Video",
        "StreamOrder": "0",
        "StreamSize": 692083520,
        "StreamSize_Proportion": "0.95446",
        "StreamSize_String1": "660 MiB",
        "StreamSize_String2": "660 MiB",
        "StreamSize_String3": "660 MiB",
        "StreamSize_String4": "660.0 MiB",
        "StreamSize_String5": "660 MiB (95%)",
        "StreamSize_String": "660 MiB (95%)",
        "Tagged_Date": "2008-05-27 18:43:05 UTC",
        "transfer_characteristics": "BT.709",
        "transfer_characteristics_Source": "Container / Stream",
        "Width": 1920,
        "Width_String": "1 920 pixels",
        "extra": {
          "CodecConfigurationBox": "avcC"
        }
      }
    ],
    "creatingLibrary": {
      "name": "MediaInfoLib",
      "version": 24.12,
      "url": "https://mediaarea.net/MediaInfo"
    }
  }

Supported Video Formats

File Extension Media Type
.mp4 video/mp4
.mkv video/x-matroska
.avi video/x-msvideo
.mov video/quicktime
.webm video/webm
.wmv video/x-ms-wmv
.flv video/x-flv
.mpeg video/mpeg
.mpg video/mpeg
.m4v video/x-m4v

...and many others. See: MediaInfo - Supported Formats

Supported Audio Formats

File Extension Media Type
.mp3 audio/mpeg
.wav audio/wav
.flac audio/flac
.ogg audio/ogg
.m4a audio/mp4
.wma audio/x-ms-wma
.aac audio/aac

...and many others. See: MediaInfo - Supported Formats

Supported Image Formats**

The MediaInfoModel provides basic image metadata (dimensions, format, color depth). A specialized EXIFModel for deeper photo metadata is on the roadmap.

File Extension Media Type
.jpg image/jpeg
.jpeg image/jpeg
.png image/png
.gif image/gif
.bmp image/bmp
.tiff image/tiff
.webp image/webp
.ico image/vnd.microsoft.icon

...and many others. See: MediaInfo - Supported Formats


Contribute

Is there a file type you'd like to see a Core Annotation Model for? Dorsal is an open-source project!

We encourage you to open a feature request on GitHub and tell us what you need!