Working with Files
This guide covers processing files on your local computer to:
- Extract core metadata
- Add your own metadata (tags and annotations)
- Synchronize metadata records with DorsalHub
LocalFile
When working with files on your local machine, use the LocalFile class.
LocalFile is a python class which you can use to create, update and manage a metadata record for a single file.
To scan a file, create a new LocalFile instance with the file path:
When you run this code, under the hood Dorsal is executing its configurable Annotation Model Pipeline.
PDF Pipeline Example
For PDFSPEC.pdf, the pipeline can be broken down into 3 stages:
-
Extract base metadata:
- Base metadata fields: file hashes, size, name, media type are extracted and form an annotation record.
- The base fields inform which other annotation models execute based on their dependencies
-
Extract PDF metadata: A PDF-specific Annotation Model extracts PDF-specific metadata
- Create File Record: The "base" and "pdf" records are used to form a new File Record.
The diagram below visualizes the pipeline:
graph LR
Input[PDFSPEC.pdf] --> Stage1
Stage1[[1: Extract Base Metadata]]
Stage1 -->|application/pdf| Stage2[[2: Extract PDF Metadata]]
Stage1 -->|Base Record| Stage3
Stage2 -->|PDF Record| Stage3
Stage3[[3: Create File Record]] --> Result([FileRecord])
Large files take longer to scan
-
Initializing a
LocalFilecalculates cryptographic hashes immediately. This process reads every byte of the file and is bound by your disk read speed. -
Subsequent scans of the same file will be instant due to the Local Record Cache.
Note: If you are running Dorsal in WSL2 in Windows, please read: WSL2 Performance.
Accessing File Metadata
Base Metadata
LocalFile exposes some base metadata fields as top-level attributes:
hash: The file's SHA-256 hash e.g."3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368"name: The file's name e.g."PDFSPEC.pdf"extension: The file's extension e.g.".pdf"size: The file's size in bytes e.g.1512313size_text: The file's size in human-readable text e.g."1 MiB"media_type: The media type e.g."application/pdf"
You can access these fields as attributes on the LocalFile instance:
| Accessing attributes on a LocalFile instance | |
|---|---|
Output:
File Name: PDFSPEC.pdf
File Size: 1 MiB
Media Type: application/pdf
SHA-256: 3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368
Filetype Metadata
Filetype-specific metadata, extracted using one of the Core Annotation Models (in the example above, the PDF Annotation Model) can be found by accessing its entry in the annotations object directly, or using a named top level attribute:
| Show Core PDF Annotation Record | |
|---|---|
Output:
{
"author": "Tim Bienz, Richard Cohn, James R. Meehan",
"title": "Portable Document Format Reference Manual (v 1.2)",
"creator": "FrameMaker 5.1.1",
"producer": "Acrobat Distiller 3.0 for Power Macintosh",
"subject": "Description of the PDF file format",
"keywords": "Acrobat PDF",
"version": "1.2",
"page_count": 394,
"creation_date": "1996-11-12T03:08:43",
"modified_date": "1996-11-12T07:58:15"
}
Pydantic Validation
The Pydantic data validation library is used extensively in Dorsal.
Dorsal uses Pydantic for type checking and enforcement, and to validate File Records.
The LocalFile class is effectively a wrapper for a single File Record, which is a Pydantic model, accessed via the LocalFile instance's model attribute.
In the example above, the Pydantic model_dump_json method is available because the pdf attribute is itself a fully validated pydantic model.
Each Core Annotation Model has a top-level attribute on the LocalFile instance which makes the annotation available, e.g. pdf, epub, mediainfo, office
| Access Core Annotation Attributes | |
|---|---|
Output:
PDF Annotation Found
PDF Title: Portable Document Format Reference Manual (v 1.2)
PDF Page Count: 394
PDF Creation Date: 1996-11-12 03:08:43
Validation Schema ID
-
Annotations, such as the one holding core PDF metadata, are all stored within the File Record organized by the ID of the Validation Schema they correspond to.
-
file/pdfis the ID of the Validation Schema for the core PDF metadata annotation. -
You will learn more about Validation Schemas as we progress through the guide. For now all you need to know is that you can retrieve an annotation by calling the
get_annotationmethod with its Validation Schema ID:
Output:
PDF Annotation Found
PDF Title: Portable Document Format Reference Manual (v 1.2)
PDF Page Count: 394
PDF Creation Date: 1996-11-12 03:08:43
Adding Metadata
A LocalFile instance has methods to manage and enrich different parts of the File Record.
In this section, we'll be using some of these methods to add additional metadata.
Adding Tags
Tags are simple key-value labels. You can add them locally using the add_public_tag and add_private_tag methods. These methods add a NewFileTag to the object's internal model.tags list.
| Adding Tags | |
|---|---|
This will output:
Tag Validation
Tags are validated by the DorsalHub API where possible.
- Online: If you are authenticated, tags are automatically validated before being added locally.
- Offline: If you are not authenticated (or Offline Mode is enabled) validation is skipped.
To learn more about tags, see the Tagging System article.
Adding Annotations
For when you want to add more than a tag, Annotations are structured metadata sub-records linked to the File Record.
Annotations conform to known validation schemas. A validation schema defines the shape of the annotation, and makes downstream processing easy and predictable.
Validation Schemas
Schemas provide structure and rules to annotations on DorsalHub.
A schema_id (e.g., "open/generic") refers to a named Validation Schema on DorsalHub.
A Validation Schema is the formal data specification that defines the structure for an annotation. Validation Schemas are valid JSON Schema documents.
While optional for offline work, any annotation pushed to DorsalHub must conform to a named schema.
Below is an example of an Annotation Record which conforms to the open/classification validation schema:
{
"labels": [
{
"label": "eng",
"score": 0.95
}
],
"vocabulary": [
"eng",
"fra"
]
}
This Annotation Record contains two top level fields:
labels: an array of objects containing values for bothlabelandscorevocabulary: a list of all possible labels
These fields are defined in the open/classification schema.
Let's add it to our File Record.
Example 1: The Standard Approach
First, we will add the annotation using the add_private_annotation method on the LocalFile class.
This is the standard way to add annotation records of any kind.
- We define our annotation record as a dictionary, making sure it matches the format expected by the
open/classificationvalidation schema. - The
add_private_annotationmethod inserts the annotation on our LocalFile instance - The annotation is validated against the schema we name with the
schema_idargument.
Option 2: The Dedicated Method
We also have a choice of several high-level helper methods which will construct the the record for us.
Instead of using add_private_annotation, we can use add_classification method and provide individual arguments to the record:
| Show Private Annotation | |
|---|---|
This approach often simplifies task, as it handles the creation of the record, but the outcome is the same.
Regardless of which approach we choose, the annotation is validated and populated on the LocalFile immediately, and is available on the annotations attribute, or by calling the get_annotation method, which provides list pf Annotation object with attribute access to the data we just added.
Output:
File Hash: 3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368
Labels: [{'label': 'eng', 'score': 0.95}]
Vocabulary: ['eng', 'fra']
Source: MyLanguageClassifier v1.2
Source Type: Manual
In the example above, the pydantic model_dump_json method is available because the pdf attribute is itself a fully validated pydantic model.
Let's look at the entire annotation record as a JSON object:
Output:
{
"record": {
"labels": [
{
"label": "eng",
"score": 0.95
}
],
"vocabulary": [
"eng",
"fra"
]
},
"private": true,
"source": {
"type": "Manual",
"detail": "MyLanguageClassifier v1.2"
},
"schema_version": "1.0.0"
}
- record: JSON content, validated against the named schema (in this case, the
open/classificationschema) - private: Privacy of the annotation. Set when the annotation is added locally. Use
private=Falseargument to create a public Annotation. - source: Metadata about the annotation. For
Manualtype annotations, thedetailfield maps directly to thesourceargument. - schema_version: The version of the schema which was used to validate this annotation.
Exporting the File Record
Each time we add a tag or annotation to the LocalFile object, we are modifying its embedded File Record.
Once we are happy we are done, the next step is to export it to save our work.
Exporting Locally (Offline)
-
Use the
savemethod to serialize a File Record: -
This process works both ways: a
LocalFilecan be (re)created from an exported File Record. -
Use
LocalFile.from_jsonto create a new LocalFile instance from a saved a File Record:Saving to JSON -
A
LocalFileinstance created in this way is functionally identical to one created by pointing it at the original file. -
Note:
LocalFile.from_jsononly works if the serialized record is a validFileRecordStrict.
Indexing to DorsalHub (Online)
Authentication Required
This action requires authentication. See the Quick Start for details.
- The
LocalFile.pushmethod uploads the entire File RecordLocalFile.model(as a validatedFileRecordStrictobject) to the DorsalHub API. -
The file itself never leaves your machine. Only the structured metadata record is published.
DorsalHub is Private by Default
Just like the dorsal file push command, calling LocalFile.push() is the same as calling LocalFile.push(private=True).
To make a record public, call lf.push(private=False).