Skip to content

Working with Files

This guide covers processing files on your local computer to:

  • Extract core metadata
  • Add your own metadata (tags and annotations)
  • Synchronize metadata records with DorsalHub

LocalFile

When working with files on your local machine, use the LocalFile class.

LocalFile is a python class which you can use to create, update and manage a metadata record for a single file.

To scan a file, create a new LocalFile instance with the file path:

from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

When you run this code, under the hood Dorsal is executing its configurable Annotation Model Pipeline.

PDF Pipeline Example

For PDFSPEC.pdf, the pipeline can be broken down into 3 stages:

  1. Extract base metadata:

    • Base metadata fields: file hashes, size, name, media type are extracted and form an annotation record.
    • The base fields inform which other annotation models execute based on their dependencies
  2. Extract PDF metadata: A PDF-specific Annotation Model extracts PDF-specific metadata

  3. Create File Record: The "base" and "pdf" records are used to form a new File Record.

The diagram below visualizes the pipeline:

graph LR
    Input[PDFSPEC.pdf] --> Stage1
    Stage1[[1: Extract Base Metadata]]
    Stage1 -->|application/pdf| Stage2[[2: Extract PDF Metadata]]


    Stage1 -->|Base Record| Stage3
    Stage2 -->|PDF Record| Stage3

    Stage3[[3: Create File Record]] --> Result([FileRecord])

Large files take longer to scan

  • Initializing a LocalFile calculates cryptographic hashes immediately. This process reads every byte of the file and is bound by your disk read speed.

  • Subsequent scans of the same file will be instant due to the Local Record Cache.

Note: If you are running Dorsal in WSL2 in Windows, please read: WSL2 Performance.

Accessing File Metadata

Base Metadata

LocalFile exposes some base metadata fields as top-level attributes:

  • hash: The file's SHA-256 hash e.g. "3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368"
  • name: The file's name e.g. "PDFSPEC.pdf"
  • extension: The file's extension e.g. ".pdf"
  • size: The file's size in bytes e.g. 1512313
  • size_text: The file's size in human-readable text e.g. "1 MiB"
  • media_type: The media type e.g. "application/pdf"

You can access these fields as attributes on the LocalFile instance:

Accessing attributes on a LocalFile instance
1
2
3
4
5
6
7
8
9
from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

# Accessing base file properties
print(f"File Name: {lf.name}")
print(f"File Size: {lf.size_text}")
print(f"Media Type: {lf.media_type}")
print(f"SHA-256: {lf.hash}")

Output:

File Name: PDFSPEC.pdf
File Size: 1 MiB
Media Type: application/pdf
SHA-256: 3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368

Filetype Metadata

Filetype-specific metadata, extracted using one of the Core Annotation Models (in the example above, the PDF Annotation Model) can be found by accessing its entry in the annotations object directly, or using a named top level attribute:

Show Core PDF Annotation Record
1
2
3
4
5
6
7
from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

pdf_annotation_record = lf.pdf  # alias for `lf.annotations.file_pdf.record`

print(pdf_annotation_record.model_dump_json(indent=2))

Output:

{
  "author": "Tim Bienz, Richard Cohn, James R. Meehan",
  "title": "Portable Document Format Reference Manual (v 1.2)",
  "creator": "FrameMaker 5.1.1",
  "producer": "Acrobat Distiller 3.0 for Power Macintosh",
  "subject": "Description of the PDF file format",
  "keywords": "Acrobat PDF",
  "version": "1.2",
  "page_count": 394,
  "creation_date": "1996-11-12T03:08:43",
  "modified_date": "1996-11-12T07:58:15"
}

Pydantic Validation

The Pydantic data validation library is used extensively in Dorsal.

Dorsal uses Pydantic for type checking and enforcement, and to validate File Records.

The LocalFile class is effectively a wrapper for a single File Record, which is a Pydantic model, accessed via the LocalFile instance's model attribute.

In the example above, the Pydantic model_dump_json method is available because the pdf attribute is itself a fully validated pydantic model.

Each Core Annotation Model has a top-level attribute on the LocalFile instance which makes the annotation available, e.g. pdf, epub, mediainfo, office

Access Core Annotation Attributes
1
2
3
4
5
6
7
8
9
from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

if lf.pdf is not None:
    print("PDF Annotation Found")
    print(f"PDF Title: {lf.pdf.title}")
    print(f"PDF Page Count: {lf.pdf.page_count}")
    print(f"PDF Creation Date: {lf.pdf.creation_date}")

Output:

PDF Annotation Found
PDF Title: Portable Document Format Reference Manual (v 1.2)
PDF Page Count: 394
PDF Creation Date: 1996-11-12 03:08:43

Validation Schema ID

  • Annotations, such as the one holding core PDF metadata, are all stored within the File Record organized by the ID of the Validation Schema they correspond to.

  • file/pdf is the ID of the Validation Schema for the core PDF metadata annotation.

  • You will learn more about Validation Schemas as we progress through the guide. For now all you need to know is that you can retrieve an annotation by calling the get_annotation method with its Validation Schema ID:

Access Annotation Record Attributes
from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

# This is equivalent to `pdf_annotation_record = lf.pdf`
pdf_annotation_records = lf.get_annotationd("file/pdf")  # The Schema-ID for core PDF metadata
pdf_annotation_record = pdf_annotation_records[0]

if pdf_annotation_record is not None:
    print("PDF Annotation Found")
    print(f"PDF Title: {pdf_annotation_record.title}")
    print(f"PDF Page Count: {pdf_annotation_record.page_count}")
    print(f"PDF Creation Date: {pdf_annotation_record.creation_date}")

Output:

PDF Annotation Found
PDF Title: Portable Document Format Reference Manual (v 1.2)
PDF Page Count: 394
PDF Creation Date: 1996-11-12 03:08:43

Adding Metadata

A LocalFile instance has methods to manage and enrich different parts of the File Record.

In this section, we'll be using some of these methods to add additional metadata.

Adding Tags

Tags are simple key-value labels. You can add them locally using the add_public_tag and add_private_tag methods. These methods add a NewFileTag to the object's internal model.tags list.

Adding Tags
1
2
3
4
5
6
7
8
9
from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

# Add a private tag
lf.add_private_tag(name="project_id", value=12345)

# Add a public tag
lf.add_public_tag(name="language", value="eng")
These tags are stored on the local object, allowing you to export them or push them to DorsalHub later.

Viewing Tags
for tag in lf.tags:
    print(f"- {tag.name}: {tag.value} (Private: {tag.private})")

This will output:

- project_id: 12345 (Private: True)
- language: English (Private: False)

Tag Validation

Tags are validated by the DorsalHub API where possible.

  • Online: If you are authenticated, tags are automatically validated before being added locally.
  • Offline: If you are not authenticated (or Offline Mode is enabled) validation is skipped.

To learn more about tags, see the Tagging System article.

Adding Annotations

For when you want to add more than a tag, Annotations are structured metadata sub-records linked to the File Record.

Annotations conform to known validation schemas. A validation schema defines the shape of the annotation, and makes downstream processing easy and predictable.

Validation Schemas

Schemas provide structure and rules to annotations on DorsalHub.

A schema_id (e.g., "open/generic") refers to a named Validation Schema on DorsalHub.

A Validation Schema is the formal data specification that defines the structure for an annotation. Validation Schemas are valid JSON Schema documents.

While optional for offline work, any annotation pushed to DorsalHub must conform to a named schema.

Below is an example of an Annotation Record which conforms to the open/classification validation schema:

Language Classification Task Output
{
  "labels": [
    {
      "label": "eng",
      "score": 0.95
    }
  ],
  "vocabulary": [
    "eng",
    "fra"
  ]
}

This Annotation Record contains two top level fields:

  • labels: an array of objects containing values for both label and score
  • vocabulary: a list of all possible labels

These fields are defined in the open/classification schema.

Let's add it to our File Record.

Example 1: The Standard Approach

First, we will add the annotation using the add_private_annotation method on the LocalFile class.

This is the standard way to add annotation records of any kind.

  • We define our annotation record as a dictionary, making sure it matches the format expected by the open/classification validation schema.
  • The add_private_annotation method inserts the annotation on our LocalFile instance
  • The annotation is validated against the schema we name with the schema_id argument.
Add Private Annotation
from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

# Define the annotation record
annotation_record = {
    "labels": [{"label": "eng", "score": 0.95}],
    "vocabulary": ["eng", "fra", "deu", "ara", "zho"],
}

# Tell the method which 'schema_id' we wish to validate against
lf.add_private_annotation(
    schema_id="open/classification",
    annotation_record=annotation_record,
    source="MyLanguageClassifier v1.2"  # Optionally provide 'source' text
)

Option 2: The Dedicated Method

We also have a choice of several high-level helper methods which will construct the the record for us.

Instead of using add_private_annotation, we can use add_classification method and provide individual arguments to the record:

Show Private Annotation
1
2
3
4
5
6
7
8
9
from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

lf.add_classification(
    labels=[{"label": "eng", "score": 0.95}],
    vocabulary=["eng", "fra"],
    source="MyLanguageClassifier v1.2"
)

This approach often simplifies task, as it handles the creation of the record, but the outcome is the same.

Regardless of which approach we choose, the annotation is validated and populated on the LocalFile immediately, and is available on the annotations attribute, or by calling the get_annotation method, which provides list pf Annotation object with attribute access to the data we just added.

Display Private Annotation
from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

lf.add_classification(
    labels=[{"label": "eng", "score": 0.95}],
    vocabulary=["eng", "fra"],
    source="MyLanguageClassifier v1.2"
)

annotations = lf.get_annotations("open/classification")
my_annotation = annotations[0]

# Like all annotations, it has a `record` and a `source` field.
my_annotation_record = my_annotation.record
my_annotation_source = my_annotation.source

print(f"File Hash: {my_annotation_record.file_hash}")
print(f"Labels: {my_annotation_record.labels}")
print(f"Vocabulary: {my_annotation_record.vocabulary}")
print(f"Source: {my_annotation_source.detail}")
print(f"Source Type: {my_annotation_source.type}")

Output:

File Hash: 3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368
Labels: [{'label': 'eng', 'score': 0.95}]
Vocabulary: ['eng', 'fra']
Source: MyLanguageClassifier v1.2
Source Type: Manual

In the example above, the pydantic model_dump_json method is available because the pdf attribute is itself a fully validated pydantic model.

Let's look at the entire annotation record as a JSON object:

Display Private Annotation
from dorsal import LocalFile

lf = LocalFile("C:/examples/PDFSPEC.pdf")

lf.add_classification(
    labels=[{"label": "eng", "score": 0.95}],
    vocabulary=["eng", "fra"],
    source="MyLanguageClassifier v1.2",
    private=True  # default value
)

my_annotation = lf.get_annotation("open/classification")
print(my_annotation.model_dump_json(indent=2))

Output:

{
  "record": {
    "labels": [
      {
        "label": "eng",
        "score": 0.95
      }
    ],
    "vocabulary": [
      "eng",
      "fra"
    ]
  },
  "private": true,
  "source": {
    "type": "Manual",
    "detail": "MyLanguageClassifier v1.2"
  },
  "schema_version": "1.0.0"
}
  • record: JSON content, validated against the named schema (in this case, the open/classification schema)
  • private: Privacy of the annotation. Set when the annotation is added locally. Use private=False argument to create a public Annotation.
  • source: Metadata about the annotation. For Manual type annotations, the detail field maps directly to the source argument.
  • schema_version: The version of the schema which was used to validate this annotation.

Exporting the File Record

Each time we add a tag or annotation to the LocalFile object, we are modifying its embedded File Record.

Once we are happy we are done, the next step is to export it to save our work.

Exporting Locally (Offline)

  • Use the save method to serialize a File Record:

    Saving to JSON
    1
    2
    3
    4
    5
    6
    from dorsal import LocalFile
    
    lf = LocalFile("C:/examples/PDFSPEC.pdf")
    lf.add_private_tag(name="project_id", value=12345)
    
    json_output = lf.save("record_export.json")
    
  • This process works both ways: a LocalFile can be (re)created from an exported File Record.

  • Use LocalFile.from_json to create a new LocalFile instance from a saved a File Record:

    Saving to JSON
        from dorsal import LocalFile
    
    lf = LocalFile("C:/examples/PDFSPEC.pdf")
    lf.add_private_tag(name="project_id", value=12345)
    
    json_output = lf.save("record_export.json")
    
    new_lf = LocalFile.from_json("record_export.json")
    
    # The content of these is identical
    assert lf.to_json() == new_lf.to_json()
    
  • A LocalFile instance created in this way is functionally identical to one created by pointing it at the original file.

  • Note: LocalFile.from_json only works if the serialized record is a valid FileRecordStrict.

Indexing to DorsalHub (Online)

Authentication Required

This action requires authentication. See the Quick Start for details.

  • The LocalFile.push method uploads the entire File Record LocalFile.model (as a validated FileRecordStrict object) to the DorsalHub API.
  • The file itself never leaves your machine. Only the structured metadata record is published.

    Indexing to DorsalHub
    1
    2
    3
    4
    5
    6
    from dorsal import LocalFile
    
    lf = LocalFile("C:/examples/PDFSPEC.pdf")
    lf.add_private_tag(name="project_id", value=12345)
    
    response = lf.push()
    

DorsalHub is Private by Default

Just like the dorsal file push command, calling LocalFile.push() is the same as calling LocalFile.push(private=True).

To make a record public, call lf.push(private=False).

➡️ Continue to: 2. Working with Remote File Records