Skip to content

Local File Record Cache

Dorsal uses a local SQLite database to cache File Records.

This means scanning a file will be faster the second time, as the record is retrieved from the cache.

The cache is a single cache.db located in the .dorsal configuration in your home directory (e.g. /home/yourname/.dorsal/cache.db)

How it Works

  • When you scan a file, Dorsal checks the cache using a composite key of:

    1. File Path (Absolute path)
    2. Date Modified (mtime from the filesystem)
  • If a record matches both, the cached result is returned.

  • If the file has been modified (different mtime) or moved (different path), Dorsal treats it as a new file, runs the full extraction pipeline, and at the end writes back the result to the cache.

  • Subsequent scans of the same file will then retrieve the record from the cache.

Bypassing or Overwriting the Cache

Sometimes you want to force a re-scan, for example, if you have updated an Annotation Model or want to debug an extractor.

When you force a re-scan, you can choose to skip the cache completely, or write-back the result:

  • Use the --skip-cache flag to bypass the cache completely.

  • This forces a scan and doesn't touch the cache:

dorsal file scan "./documents/mydocument.pdf" --skip-cache

  • Use the --overwrite-cache to both force a scan (skip reading the cache) and write-back the result to the cache.

  • This is a full refresh for that file record in the cache.

dorsal file scan "./documents/mydocument.pdf" --overwrite-cache
  • Set use_cache=False in supported classes or functions to bypass the cache completely.

  • This forces a scan, and doesn't touch the cache:

from dorsal import LocalFile

# Forces a full scan, even if the record is in the cache
lf = LocalFile("./documents/mydocument.pdf", use_cache=False)

  • Set overwrite_cache=True in supported classes or functions to both force a scan (skip reading the cache) and write-back the result to the cache.

  • This is a full refresh for that file record in the cache.

from dorsal import LocalFile

# Forces a full scan, even if the record is in the cache
lf = LocalFile("./documents/mydocument.pdf", overwrite_cache=True)

Configuration

You can control the cache behavior globally using the Dorsal configuration file or by setting environment variables.

Precedence Order:

  1. Runtime Arguments (--skip-cache, use_cache=False)
  2. Environment Variables
  3. Configuration File (dorsal.toml)
  4. Defaults (enabled=true)

Environment Variables

Variable Type Default Description
DORSAL_CACHE_ENABLED bool true Set to false or 0 to disable all reading/writing to the cache.
DORSAL_CACHE_COMPRESSION bool true Set to false to store uncompressed JSON instead of zlib-compressed blobs.

Config File

Settings in your dorsal.toml config.

[cache]
enabled = true
compression = true

Managing the Cache

To clear, prune, or build the cache, use the CLI tools.