The Quick Hash

The Quick Hash is a specialized sample-based hash function, generated for the primary purpose of rapidly identifying files.

While a standard cryptographic hash (like SHA-256) must read every single byte of a file into memory, when a Quick Hash is generated, only a small, deterministic sample of the file's content is used.

Because file hashing is generally an I/O-bound process, it is usually much faster to generate a Quick Hash for large files, than it would be to calculate a secure hash.

How It Works

The Quick Hash is generated through a multi-step sampling process designed to be both fast and deterministic (meaning the exact same file will always produce the exact same quick hash).

File Size Validation: The process only runs on files within a specific size range. By default, the file must be at least 32 MiB and no larger than 1 PiB.
Deterministic Seeding: The sampling process is seeded with a number generated from the file's exact size in bytes. This guarantees that for any two identical files, the exact same sample locations will be chosen every time.
Chunk Sampling: The hasher selects a number of 1 MiB chunks to read. This number scales with the file's size, from a minimum of 8 chunks up to a maximum of 1,024.
Final Hash: The hasher seeks directly to each sample location, reads the 1 MiB chunk of data, and feeds it into a SHA-256 hash instance. The final Quick Hash is the digest of all combined samples.

Use Cases and Limitations

When to Use a Quick Hash

Finding Duplicates: For a fast "first pass" on a massive directory to identify potential duplicate files.
Large File Triage: Quickly identifying large video files, disk images, or scientific datasets where a full hash would be too slow.
As an Identifier: It is available as the quick_hash field in the Dorsal File Record.

Important Limitations

The Quick Hash should not be used in place of a secure hash.

Risk of Collisions: Because it does not read the entire file, it is possible for two different files to have the same Quick Hash. This makes it unsuitable as a globally unique identifier.
Not for Similarity: The Quick Hash is not a similarity hash. For finding similar files, Dorsal supports the TLSH hash (available via dorsal file hash --tlsh).

Generating a Quick Hash

1. Generating a Quick Hash

CLIPython

You can generate a Quick Hash for a single file using the dorsal file hash command with the --quick flag.

For a large file (over 32 MiB):

dorsal file hash /path/to/large-video-file.mkv --quick

You can generate a Quick Hash for a single file by using the get_quick_hash function:

from dorsal.file import get_quick_hash

quick_hash = get_quick_hash(file_path="/path/to/large-video-file.mkv")

print(quick_hash)

The output when generating a Quick Hash is a 64 character string

Example

c3d0b67d8f1e5a8d3e9c1f2a3b4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d