Skip to content

Validation Schemas

A Validation Schema is the formal data specification that defines the structure for an annotation.

When you add an annotation to a file—for example, by calling LocalFile.add_private_annotation() you must provide a schema_id (e.g., open/generic). DorsalHub uses this schema_id to validate your data against a named validation schema.

Note that annotation validation is optional (but recommended) for offline work; however to push any annotations to DorsalHub, they must conform to a named schema.

If your annotation does not match the rules in the specification (e.g., it's missing a required field or has an incorrect data type), the API will reject it.

This ensures all annotation data is reliable and correctly structured.

Open Validation Schemas

The Open Validation Schemas are a collection of schemas which are open to everyone. They adhere to the following principles:

  • Open Contribution: Any authenticated DorsalHub user is welcome to publish annotation records that validate against these schemas.
  • Community-Driven: We welcome contributions! You can visit the github repository to report any issues, propose new schemas or suggest improvements.
  • Stable & Versioned: Schemas are versioned, and changes are made carefully to ensure backward compatibility wherever possible.

There are currently 9 schemas:

Schema Use case
open/audio-transcription Storing transcribed text with support for timed segments, speaker identification, and non-verbal events.
open/classification Applying labels to a file, with support for confidence scores, vocabularies, and score explanations.
open/document-extraction Storing extracted document layout data, including text blocks, lines, bounding boxes, and polygons.
open/embedding Storing feature vectors and identifying the algorithm used to generate them.
open/entity-extraction Storing extracted entities for NER or slot-filling, with support for vocabularies and geometric locations.
open/generic A flexible catch-all schema for storing flat, arbitrary key-value data.
open/geolocation For storing a strict GeoJSON Feature object (Point, Polygon, etc.) associated with a file.
open/llm-output For storing LLM prompts and responses, including generation parameters, provenance, and evaluation scores.
open/object-detection For identifying objects using boxes or polygons, with support for hierarchical relationships and vocabularies.

Private Organization Schemas

Membership in a DorsalHub Organization may access private schemas (e.g., my-company/invoice-data) that are only visible and usable by other members of the organization.