The Dorsal File Record
The Dorsal File Record (often shortened to File Record or simply record) is a highly structured object which contains file metadata.
A single File Record contains the metadata for one file.
When you view file information on DorsalHub, everything you see is generated from the File Record for that file. This may include user-contributed tags, such as language or country; as well as filetype-specific data, such as page_count for documents or codec for media files.
File Record
Throughout this documentation, the term File Record is used instead of Dorsal File Record, but the meaning is the same.
Creating File Records
File Records are a fundamental building block in Dorsal.
To generate a File Record from a file on your computer, you can run dorsal file scan or use the LocalFile class in python:
-
If you are using the Command Line Interface (CLI), the File Record will be displayed in the terminal. See CLI Guide: Files
-
If you are using the Python API, the
LocalFileobject represents a single File Record. See Python Guide: Files
Dorsal contains many tools and utilities for generating and enriching File Records locally; while DorsalHub extends this capability by making collaboration possible.
A Tale of Two File Records
If you generate a local File Record for a file which already has a public record on DorsalHub, you may notice that the File Record on DorsalHub has annotations or tags which your locally generated one does not.
This is because the public File Record on DorsalHub may have additional annotations and tags added by other users.
Your locally generated File Record is based entirely on a file on your system, and the annotations are limited to those in the Annotation Model Pipeline you used to generate it.
Crucially, if you modify your local File Record (adding your own custom tags and annotations) and then push it to DorsalHub as a public record, then any public tags or annotations you added will become part of that public File Record.
File Record Structure
The simplest and most universal way to represent a File Record is as a JSON object; however in Dorsal, you have the option to work with File Records as python dictionaries, interactive LocalFile objects, or even Pydantic models.
A File Record is composed of three main parts:
- Hashes: Long sequences of letters and numbers, used to identify the file.
- Annotations: Structured sub-records containing information relevant to the file.
- Tags: User-generated key-value pairs which describe some aspect of the file.
For the rest of this tutorial, we will inspect a File Record as a JSON object.
1. Hashes
These sit in the top level of the File Record, and are used to identify the file.
The hashes are 64 character hexadecimal sequences, which look like this: 9abdfb32750a278d5ca550b876e94a72cd8eec82d0e506a127dfb94bd56ca4b2 (similarity_hash has a different format, and is the exception)
| Field Name | Description | Note |
|---|---|---|
hash |
The SHA-256 hash of the file. This is the file's primary ID on DorsalHub. | |
validation_hash |
The BLAKE3 hash. This is used server-side for secondary validation checks. | |
quick_hash |
A sample-based hash used for quick lookups. (see: Concepts: Quick Hash) | Not available for files smaller than 32 MiB |
similarity_hash |
The TLSH hash (a kind of similarity hash). | This is optional. Requires py-tlsh package is installed. |
Example
Here is the top part of a locally-generated File Record representing PDFSPEC.pdf.
This contains the hashes which identify the file:
{
"hash": "3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368",
"validation_hash": "9abdfb32750a278d5ca550b876e94a72cd8eec82d0e506a127dfb94bd56ca4b2",
"similarity_hash": "T13465D67BB4C61D6DF893CA46571C579B8B0D71533BAEA58604BDAF0AC6338029AC3F41",
...
}
quick_hash is not part of this record because the file's size is below 32 MiB.
Note
While they can be used to look up a File Record, the validation_hash and quick_hash are not included in any server responses from DorsalHub when retrieving records.
2. Annotations
An annotation is a metadata sub-record, which provides additional information, relevant to the File Record.
They can be found in the annotations key in the File Record root.
annotations is a container (a JSON object) that holds named annotation sub-records.
Each annotation has two parts:
recordor Annotation Record: This provides information about each file in a structured format.sourceor Annotation Source: This provides information about the origin of the annotation - typically which model generated it.
Crucially, every Annotation Record conforms to a specific named validation schema.
If we scan PDFSPEC.pdf using either dorsal file scan or LocalFile in python, a File Record is generated which contains two annotations file/base and file/pdf.
You can see file/base and file/pdf as keys in the annotations field of the File Record below:
{
"hash": "...",
"validation_hash": "...",
"similarity_hash": "...",
"annotations": {
"file/base": { ... },
"file/pdf": { ... },
},
...
}
Let's look at these in more detail.
The "base" Annotation: file/base
Every File Record has a file/base annotation. This contains the most fundamental file information about a file: its name, size, media_type and so on.
Example
{
"hash": "...",
...
"annotations": {
"file/base": {
"record": {
"hash": "3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368",
"similarity_hash": "T13465D67BB4C61D6DF893CA46571C579B8B0D71533BAEA58604BDAF0AC6338029AC3F41",
"name": "PDFSPEC.pdf",
"extension": ".pdf",
"size": 1512313,
"media_type": "application/pdf",
"media_type_prefix": "application"
},
"source": {
"type": "Model",
"model": "dorsal/base",
"version": "1.0.0"
}
}
...
}
}
As with all annotations, the file/base annotation has two separate parts:
record: Forfile/baseannotations this contains core file metadata and is validated against thefile/baseschema.source: Indicates that this annotation was generated by thedorsal/basemodel.
File-Type Annotation: file/pdf
Dorsal comes with a number of Annotation Models for extracting key metadata from files of a particular type (e.g. Ebooks, PDFs, Media Files).
When you generate a File Record for a PDF, a PDF-specific Annotation Model will run, and its output will be saved as the file/pdf annotation.
Example
{
"hash": "...",
...
"annotations": {
"file/base": { ... },
"file/pdf": {
"record": {
"author": "Tim Bienz, Richard Cohn, James R. Meehan",
"title": "Portable Document Format Reference Manual (v 1.2)",
"creator": "FrameMaker 5.1.1",
"producer": "Acrobat Distiller 3.0 for Power Macintosh",
"subject": "Description of the PDF file format",
"keywords": "Acrobat PDF",
"version": "1.2",
"page_count": 394,
"creation_date": "1996-11-12T03:08:43",
"modified_date": "1996-11-12T07:58:15"
},
"source": {
"type": "Model",
"model": "dorsal/pdf",
"variant": "pypdfium2",
"version": "1.0.0"
},
}
}
}
record: This contains file metadata specific to PDFs, and is validated against thefile/pdfschema.source: This indicates that this annotation was generated by thedorsal/pdfAnnotation Model.
Notice the pattern is the same: a schema-validated record and a source explaining where it came from.
Custom Annotations
You are not limited to the file-type models built into Dorsal. You can annotate File Records manually in the Python API using the LocalFile object. You can also include your own custom Annotation Models by modifying the pipeline config.
The example below shows a manually added custom annotation:
Example
{
"hash": "...",
...
"annotations": {
"file/base": { ... },
"file/pdf": { ... },
"open/classification": {
"record": {
"labels": [
{
"label": "eng",
"score": 0.95
}
],
"vocabulary": [
"eng",
"fra"
]
},
"private": true,
"source": {
"type": "Manual",
"detail": "Language Detection Model v0.1"
},
"schema_version": "1.0.0"
}
}
}
Unlike the default file-type annotations like file/base and file/pdf, custom annotations have one additional top level field:
private: This boolean (True/False) field indicates whether the annotation is private (only you can view it) or public (anyone can view it)
Like all other annotations, they contain record and source fields:
record: This holds data which validates against theopen/classificationvalidation schema. In this case, it's a single language label"eng"with a confidence score and a vocabulary of possible labels.source: Indicates this was added manually (e.g. via aLocalFileobject) and also has a"detail"field with more information.
To learn more about annotating File Records, see the Python Guide: Working with Files
Open Validation Schemas
All annotations published to DorsalHub must conform to a named validation schema.
There are a number of validation schemas available, including open/classification, as seen in the example above.
Annotation Stubs
When retrieving File Records via the API, all custom annotations are included as Annotation Stubs.
Annotation Stubs provide key information about the annotation, including the source data and a retrieval URL.
Crucially, Annotation Stubs do not contain the record content of the annotation. This prevents File Records becoming too bloated, as a single File Record may have dozens or even hundreds of annotations.
Below is an extract from a File Record retrieved from DorsalHub, showing the open/classification annotations as an array. An array is used because there may more than one annotation linked to a file which correspond to the same schema.
Example:
{
"hash": "...",
...
"annotations": {
"file/base": { ... },
"file/pdf": { ... },
"open/classification": [
{
"id": "8171d2fe-e433-4ddc-a525-56aef19734c2",
"source": {
"type": "Manual",
"id": "Language Detection Model v0.1"
},
"user_no": 1000004,
"date_modified": "2025-11-03T11:13:57.165000Z",
"url": "/files/1418edf5dc3ebb2b0cb0451925541cba64eaa9896b6d8f6ae46f1372b99ac595/annotations/8171d2fe-e433-4ddc-a525-56aef19734c2"
}
],
}
}
3. Tags
Tags are a simple key-value labels you can attach to any File Record.
Tags can be used for quick labeling, filtering, and search.
On the File Record, The tags field is an array of individual tag objects. Each tag contains the following fields:
| Field | Example | Description |
|---|---|---|
id |
"69035139793ae72b07f05380" | A unique ID assigned by DorsalHub, used to manage (e.g., delete) the tag. Locally-added tags won't have an id |
name |
"language" | The name of the tag. For private tags this can be anything, but for public tags must be one of the Supported Tags |
value |
"English" | The tag's value. This can be text, number or boolean (True/False) |
private |
true | When true, the tag is only visible to you. false means it's visible to anyone with access to the File Record |
hidden |
false | This field is currently unused, but may be used in future to hide tags with negative vote aggregates |
upvotes |
1 | If two users add a public tag with the same name and value, rather than adding a new tag, the second user's will "upvote" the existing tag |
downvotes |
0 | This field is currently unused, but may be connected to a "downvote" button on the Web in future |
origin |
"DorsalHub" | This value is different for tags you add on local files vs. ones you retrieve from the DorsalHub API. |
{
"hash": "...",
"annotations": { ... },
"tags": [
{
"id": "69035139793ae72b07f05380",
"name": "language",
"value": "English",
"private": true,
"hidden": false,
"upvotes": 0,
"downvotes": 0,
"origin": "DorsalHub"
},
{
"id": "6903b66b3dd03157960df671",
"name": "project_id",
"value": 12345,
"private": true,
"hidden": false,
"upvotes": 0,
"downvotes": 0,
"origin": "dorsal.LocalFile"
}
]
...
}
Public vs. Private File Records
The metadata for a single file, identified by its SHA-256 hash, can exist in two distinct states on DorsalHub: as a Private File Record or a Public File Record. A user can maintain their own private record for a file that also has a public record.
Regardless of which record type is retrieved (public or private), the annotations and tags shown to the user are always a combination of:
1. All public metadata.
2. All of that user's own private metadata.
For more information on this distinction see: Public vs. Private Records