Skip to content

Welcome to the DorsalHub Documentation!

DorsalHub is your file metadata platform.

DorsalHub makes it easy to securely search and manage file metadata records from anywhere.

Built with developers in mind, DorsalHub is powered by Dorsal - a local-first metadata generation toolkit.

The Quick Start guide below will get you started with Dorsal.


Dorsal

What is Dorsal?

Dorsal is a Python library and command line tool for generating, validating, and managing structured file metadata.

Dorsal is:

  • Local First: Metadata extraction happens locally on your machine, not in the cloud.
  • Strictly Validated: All metadata records are checked against strict JSON Schemas and Pydantic models.
  • Extensible: Support your own file types and metadata annotation needs by integrating your own models.

Quick Start

This guide covers:

  1. Installing Dorsal
  2. Scanning your first file
  3. Authenticating with DorsalHub (Optional)
  4. Pushing metadata to DorsalHub (Requires Authentication)

Install Dorsal

Dorsal is available on PyPI as dorsalhub.

pip install dorsalhub
Use a Virtual Environment with pip

To avoid conflicts with other system packages, we recommend installing Dorsal in a fresh virtual environment.

1. Create the environment

python3 -m venv venv

2. Activate the environment

source venv/bin/activate
.\venv\Scripts\activate

uv is a popular Python package installer and resolver, known for being fast. Installing UV

uv pip install dorsalhub

Once installation is complete, verify the install by running dorsal --version:

dorsal --version

Your output should resemble this, showing you the version of Dorsal which is installed:

Dorsal Version 0.3.0

Scan a File

At its heart Dorsal is a toolkit for creating and managing structured metadata records from your files, and it ships with offline metadata extractors for a number of different file types, including PDFs, Office Documents, Video and Audio files.

The quickest way to get started is in the terminal you just installed Dorsal.

  1. Locate a file you'd like to scan, and copy its path.

  2. Use the dorsal file scan command with the path to that file:

    dorsal file scan "docs/PDFSPEC.pdf"
    

    When the scan completes, you should see something similar to this:

    📄 Scanning metadata for PDFSPEC.pdf
    ╭───────────────────────────────── File Record: PDFSPEC.pdf ─────────────────────────────────╮
    │                                                                                            │
    │    Hashes                                                                                  │
    │       SHA-256:  3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368           │
    │        BLAKE3:  9abdfb32750a278d5ca550b876e94a72cd8eec82d0e506a127dfb94bd56ca4b2           │
    │          TLSH:  T13465D67BB4C61D6DF893CA46571C579B8B0D71533BAEA58604BDAF0AC6338029AC3F41   │
    │                                                                                            │
    │    File Info                                                                               │
    │     Full Path:  /dev/test/docs/PDFSPEC.pdf                                                 │
    │      Modified:  2025-04-09 15:09:05                                                        │
    │          Name:  PDFSPEC.pdf                                                                │
    │          Size:  1 MiB                                                                      │
    │    Media Type:  application/pdf                                                            │
    │                                                                                            │
    │    Tags                                                                                    │
    │        No tags found.                                                                      │
    │                                                                                            │
    │    Pdf Info                                                                                │
    │            author:  Tim Bienz, Richard Cohn, James R. Meehan                               │
    │             title:  Portable Document Format Reference Manual (v 1.2)                      │
    │           creator:  FrameMaker 5.1.1                                                       │
    │          producer:  Acrobat Distiller 3.0 for Power Macintosh                              │
    │           subject:  Description of the PDF file format                                     │
    │          keywords:  Acrobat PDF                                                            │
    │           version:  1.2                                                                    │
    │        page_count:  394                                                                    │
    │     creation_date:  1996-11-12T03:08:43                                                    │
    │     modified_date:  1996-11-12T07:58:15                                                    │
    │                                                                                            │
    │                                                                                            │
    ╰────────────────────────────────────────────────────────────────────────────────────────────╯
    

    This panel shows the core metadata fields for this record.

  3. You can export the record to JSON straight from the CLI by adding the --json flag

    dorsal file scan "docs/PDFSPEC.pdf" --json
    

    This outputs the JSON to stdout, so you can redirect it to a file or pipe it to other tools:

    dorsal file scan "docs/PDFSPEC.pdf" --json > "example.json"
    

    The JSON output is a fully-validated File Record 👇

    Example File Record: PDFSPEC.pdf

    {
      "hash": "3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368",
      "validation_hash": "9abdfb32750a278d5ca550b876e94a72cd8eec82d0e506a127dfb94bd56ca4b2",
      "annotations": {
        "file/base": {
          "record": {
            "hash": "3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368",
            "name": "PDFSPEC.pdf",
            "extension": ".pdf",
            "size": 1512313,
            "media_type": "application/pdf",
            "media_type_prefix": "application"
          },
          "source": {
            "type": "Model",
            "model": "dorsal/base",
            "version": "1.0.0"
          }
        },
        "file/pdf": {
          "record": {
            "author": "Tim Bienz, Richard Cohn, James R. Meehan",
            "title": "Portable Document Format Reference Manual (v 1.2)",
            "creator": "FrameMaker 5.1.1",
            "producer": "Acrobat Distiller 3.0 for Power Macintosh",
            "subject": "Description of the PDF file format",
            "keywords": "Acrobat PDF",
            "version": "1.2",
            "page_count": 394,
            "creation_date": "1996-11-12T03:08:43",
            "modified_date": "1996-11-12T07:58:15"
          },
          "private": true,
          "source": {
            "type": "Model",
            "model": "dorsal/pdf",
            "version": "1.0.0",
            "variant": "pypdfium2"
          }
        }
      },
      "tags": [],
      "source": "disk",
      "local_attributes": {
        "date_modified": "2025-04-09 15:09:05.533199+01:00",
        "date_accessed": "2025-11-28 10:37:08.225267+00:00",
        "date_created": "2025-07-17 11:07:52.875623+01:00",
        "file_path": "/dev/test/docs/PDFSPEC.pdf",
        "file_size_bytes": 1512313,
        "file_permissions_mode": 33279,
        "inode": 3940649675394997,
        "number_of_links": 1
      },
      "local_filesystem": {
        "full_path": "/dev/test/docs/PDFSPEC.pdf",
        "date_created": "2025-07-17T11:07:52.875623+01:00",
        "date_modified": "2025-04-09T15:09:05.533199+01:00"
      }
    }
    
    Notice the file/pdf key under annotations stores a separate object housing PDF-specific fields

    For more information on the dorsal file commands, see the full CLI Guide: Files

  1. LocalFile is a python class which you can use to create, update and manage a metadata record for a single file.

    To scan a file, create a new LocalFile instance with the file path:

    from dorsal import LocalFile
    
    lf = LocalFile("C:/examples/PDFSPEC.pdf")
    
  2. LocalFile exposes some base metadata fields as top-level attributes:

    • hash: The file's SHA-256 hash e.g. "3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368"
    • name: The file's name e.g. "PDFSPEC.pdf"
    • extension: The file's extension e.g. ".pdf"
    • size: The file's size in bytes e.g. 1512313
    • size_text: The file's size in human-readable text e.g. "1 MiB"
    • media_type: The media type e.g. "application/pdf"

    You can access these fields as attributes on the LocalFile instance:

    Accessing attributes on a LocalFile instance
    1
    2
    3
    4
    5
    6
    7
    8
    9
    from dorsal import LocalFile
    
    lf = LocalFile("C:/examples/PDFSPEC.pdf")
    
    # Accessing base file properties
    print(f"File Name: {lf.name}")
    print(f"File Size: {lf.size_text}")
    print(f"Media Type: {lf.media_type}")
    print(f"SHA-256: {lf.hash}")
    

    Output:

    File Name: PDFSPEC.pdf
    File Size: 1 MiB
    Media Type: application/pdf
    SHA-256: 3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368
    
  3. You can visualize the entire record by calling the to_dict or to_json methods:

    Display the full file record in Python
    1
    2
    3
    4
    5
    from dorsal import LocalFile
    
    lf = LocalFile("C:/examples/PDFSPEC.pdf")
    
    print(lf.to_json())
    

    The JSON printout is a fully-validated File Record 👇

    Example File Record: PDFSPEC.pdf

    {
      "hash": "3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368",
      "validation_hash": "9abdfb32750a278d5ca550b876e94a72cd8eec82d0e506a127dfb94bd56ca4b2",
      "annotations": {
        "file/base": {
          "record": {
            "hash": "3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368",
            "name": "PDFSPEC.pdf",
            "extension": ".pdf",
            "size": 1512313,
            "media_type": "application/pdf",
            "media_type_prefix": "application"
          },
          "source": {
            "type": "Model",
            "model": "dorsal/base",
            "version": "1.0.0"
          }
        },
        "file/pdf": {
          "record": {
            "author": "Tim Bienz, Richard Cohn, James R. Meehan",
            "title": "Portable Document Format Reference Manual (v 1.2)",
            "creator": "FrameMaker 5.1.1",
            "producer": "Acrobat Distiller 3.0 for Power Macintosh",
            "subject": "Description of the PDF file format",
            "keywords": "Acrobat PDF",
            "version": "1.2",
            "page_count": 394,
            "creation_date": "1996-11-12T03:08:43",
            "modified_date": "1996-11-12T07:58:15"
          },
          "private": true,
          "source": {
            "type": "Model",
            "model": "dorsal/pdf",
            "version": "1.0.0",
            "variant": "pypdfium2"
          }
        }
      },
      "tags": [],
      "source": "disk",
      "local_attributes": {
        "date_modified": "2025-04-09 15:09:05.533199+01:00",
        "date_accessed": "2025-11-28 10:37:08.225267+00:00",
        "date_created": "2025-07-17 11:07:52.875623+01:00",
        "file_path": "/dev/test/docs/PDFSPEC.pdf",
        "file_size_bytes": 1512313,
        "file_permissions_mode": 33279,
        "inode": 3940649675394997,
        "number_of_links": 1
      },
      "local_filesystem": {
        "full_path": "/dev/test/docs/PDFSPEC.pdf",
        "date_created": "2025-07-17T11:07:52.875623+01:00",
        "date_modified": "2025-04-09T15:09:05.533199+01:00"
      }
    }
    
    Notice the file/pdf key under annotations stores a separate object housing PDF-specific fields

    For more information on the Python API and the LocalFile class, see the Python API Docs


Authenticate (Optional)

While Dorsal is a capable offline tool, connecting it to DorsalHub unlocks its full potential.

  1. To authenticate, first generate an API Key on DorsalHub.

  2. Authenticate Dorsal:

    There are two ways to authenticate:

    • Use the dorsal auth login command in your terminal
    • Set the DORSAL_API_KEY environment variable
    • Run dorsal auth login:

      dorsal auth login
      
    • Paste your API key when prompted:

      API Key: ***********************
      🔑 Verifying key with DorsalHub...
      
      ╭──────────  Login Successful ─────────────╮
      │                                          │
      │  User ID:        1230321                 │
      │  Name:           yourname                │
      │  Email:          your.email@example.com  │
      │  Account Status: Member                  │
      │                                          │
      ╰──────────────────────────────────────────╯
      
    • Dorsal is now authenticated in both the Python API and Command Line Interface.

    • Your API Key is stored in Dorsal's global configuration file (e.g. /home/user/.dorsal/dorsal.toml).

    • For more information on the dorsal auth commands, see the CLI Guide: Autentication

    Set the environment variable DORSAL_API_KEY to your API Key.

    This command for setting environment variables varies by operating system and shell.

    • macOS / Linux:

      export DORSAL_API_KEY="YourAPIKey"
      
    • Windows (PowerShell):

      $env:DORSAL_API_KEY="YourAPIKey"
      
    • Windows (Command Prompt):

      set DORSAL_API_KEY="YourAPIKey"
      

    Setting the DORSAL_API_KeY environment variable will authenticate Dorsal within the current terminal session.

    Once the environment variable is set, can confirm you are logged in by running the dorsal auth whoami command:

    dorsal auth whoami
    

    This command prints your current logged-in status:

    Verifying session with DorsalHub...
    
    ╭───────────  Authenticated User ──────────╮
    │                                          │
    │  User ID:        1230321                 │
    │  Name:           yourname                │
    │  Email:          your.email@example.com  │
    │  Account Status: Member                  │
    │                                          │
    ╰──────────────────────────────────────────╯
    

    This confirms that Dorsal is now authenticated.

    Session Only

    Setting an environment variable this way authenticates Dorsal for your current terminal session only. You will need to set it again if you open a new terminal.

    For a persistent login, use dorsal auth login to save the API Key to the global config file.

    API Key Safety

    Your API Key is a secure credential, just like a password. You must store it safely and never share it.

    For more information about API Keys, including safety tips, see API Keys.


Push a Record to DorsalHub

Note

This step requires API-Key authentication.

  1. Use the dorsal file push command to create and securely publish a structured metadata record to DorsalHub.

    dorsal file push "docs/PDFSPEC.pdf"
    

    When complete it will show something like:

    📡 Preparing to push metadata for PDFSPEC.pdf as a private record...
    ╭─────────────────────────────  Push Complete ──────────────────────────────────╮
    │ The file record was successfully pushed to DorsalHub.                         │
    │                                                                               │
    │ SHA256 Hash: 3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368 │
    ╰───────────────────────────────────────────────────────────────────────────────╯
    

    Note that while the metadata record is pushed to DorsalHub, the file itself never leaves your machine.

    DorsalHub is Private by Default

    When you run dorsal file push "docs/PDFSPEC.pdf", you are telling the server to create a private record about that file.

    Private records are only visible to you.

    To make a public record, you should add the --public argument to the command:

    dorsal file push "docs/PDFSPEC.pdf" --public
    
  2. View it Online

    Head over to your DorsalHub Dashboard to see the newly indexed file and its extracted metadata.

  1. The LocalFile.push method uploads the entire File Record LocalFile.model (as a validated FileRecordStrict object) to the DorsalHub API.

    Indexing to DorsalHub
    1
    2
    3
    4
    from dorsal import LocalFile
    
    lf = LocalFile("C:/examples/PDFSPEC.pdf")
    response = lf.push()
    

    Note that while the metadata record is pushed to DorsalHub, the file itself never leaves your machine.

    DorsalHub is Private by Default

    When you call LocalFile.push, you are telling the server to create a private record about that file.

    Private records are only visible to you.

    To make a public record, you should add the public=True argument to the push method:

    Indexing publicly to DorsalHub
    1
    2
    3
    4
    from dorsal import LocalFile
    
    lf = LocalFile("C:/examples/PDFSPEC.pdf")
    lf.push(public=True)
    
  2. View it Online

    Head over to your DorsalHub Dashboard to see the newly indexed file and its extracted metadata.


Custom Extractors

Dorsal is for more than just core file metadata. You can create custom Annotation Models in Python to extract specific data from your files.

Annotation Models can be as simple or as complex as you like. For example, a simple model to count words in a text file:

Simple Word Counting Annotation Model
from dorsal import AnnotationModel
from dorsal.file.helpers import build_generic_record

class WordCount(AnnotationModel):
    def main(self):
        with open(self.file_path, 'r') as f:
            count = len(f.read().split())

        return build_generic_record(
            description="Word Count",
            data={"count": count}
        )

This model can be registered to your local Model Pipeline to run automatically every time you scan a text file.

Check out the tutorial: Introduction to Annotation Models


Next Steps

You've indexed your first file! Here's where to go next:

  • ⌨️ Learn the CLI


    Learn how to manage files, add tags, create collections, and more, directly from your terminal.

    See the CLI Guide

  • 🐍 Learn the Python API


    Integrate Dorsal into your applications for custom metadata workflows, analysis, and automation.

    See the Python Guide

  • 🖥️ Explore DorsalHub


    Get oriented with the DorsalHub website. View and organize your indexed files from your dashboard.

    Go to the main website

  • 🧑‍💻 I want to contribute...


    Dorsal is open source, and provided under the Apache 2.0 license. Report an issue, or suggest new features on our GitHub repository.

    View Source on GitHub