Standard Transcription JSON (STJ) Format Specification

Version: 0.5 Date: 2024-10-24

Introduction

The Standard Transcription JSON (STJ) format is a proposed standard for representing transcribed audio and video data in a structured, machine-readable JSON format. It aims to provide a comprehensive and flexible framework that is a superset of existing transcription and subtitle formats such as SRT, WebVTT, TTML, SSA/ASS, and others.

The STJ format includes detailed transcription segments with associated metadata such as speaker information, timestamps, confidence scores, language codes, and styling options. It also allows for optional metadata about the transcription process, source input, and the transcriber application.

File Extension: .stj.json
MIME Type: application/vnd.stj+json Character Encoding: UTF-8

Objectives

Interoperability: Enable seamless data exchange between different transcription services and applications.
Superset of Existing Formats: Incorporate features from common formats (SRT, WebVTT, TTML, etc.) to ensure compatibility and extensibility.
Extensibility: Allow for future enhancements without breaking compatibility.
Clarity: Provide a clear and well-documented structure for transcription data.
Utility: Include useful metadata to support a wide range of use cases.
Best Practices Compliance: Adhere to state-of-the-art best practices in metadata representation and documentation standards.

Specification

The STJ files must include a version field within the metadata section to indicate the specification version they comply with. This facilitates compatibility and proper validation across different implementations.

Version History

For a detailed list of changes between versions, please see the CHANGELOG.md file.

Overview

The STJ file is a JSON object containing two main sections:

"metadata": Contains information about the transcription process, source input, and other relevant details.
"transcript": Contains the actual transcription data, including speaker information, segments, and optional styling.

{
  "metadata": { ... },
  "transcript": { ... }
}

Mandatory vs. Optional Fields

Mandatory Fields: Essential for basic functionality and compatibility.
Optional Fields: Provide additional information and features but are not required for basic use.

Metadata Section

The "metadata" object includes optional and required fields providing context about the transcription.

Fields

transcriber (mandatory): Information about the transcriber application or service.
name (string, mandatory): Name of the transcriber application.
version (string, mandatory): Version of the transcriber application.
created_at (string, mandatory): ISO 8601 timestamp indicating when the transcription was created.
version (mandatory): Specifies the STJ specification version the file adheres to.
Format: Semantic versioning (e.g., "0.5.0")
Pattern: Must follow the regex pattern ^\d+\.\d+\.\d+$ to ensure semantic versioning.
source (optional): Information about the source of the audio/video.
uri (string, optional): The URI of the source media.
- MUST conform to the URI Format Requirements specified in the Field Definitions and Constraints section.
duration (number, optional): Duration of the media in seconds.
languages (array of strings, optional): List of languages present in the source media, ordered by prevalence.
languages (array of strings, optional): List of languages present in the transcription, ordered by prevalence.
confidence_threshold (number, optional): Confidence threshold used during transcription (0.0 - 1.0).
extensions (object, optional): A key-value map for any additional metadata.
MUST conform to the Extensions Field Requirements specified in the Field Definitions and Constraints section.

Clarification on `languages` Fields

The STJ format includes two languages fields within the metadata section to distinguish between the languages present in the source media and those represented in the transcription.

metadata.source.languages (array of strings, optional):
Definition: Languages expected or detected in the source media.
Purpose: Indicates the original languages spoken in the audio/video content.
Use Case: Useful for applications that need to know what languages are present in the source, perhaps for transcription, translation, or language detection purposes.
metadata.languages (array of strings, optional):
Definition: Languages present in the transcription.
Purpose: Indicates the languages included in the transcription data within the STJ file.
Use Case: Essential for applications processing the transcription to know what languages they need to handle. This list may differ from metadata.source.languages if the transcription excludes some source languages or includes translations into new languages.

Example

"metadata": {
  "transcriber": {
    "name": "YAWT",
    "version": "0.4.0"
  },
  "created_at": "2023-10-20T12:00:00Z",
  "version": "0.5.0",
  "source": {
    "uri": "https://example.com/multilingual_media.mp4",
    "duration": 3600.5,
    "languages": ["en", "es"]  // Source languages: English and Spanish
  },
  "languages": ["fr"],          // Transcription language: French
  "confidence_threshold": 0.6,
  "extensions": {
    "project_info": {
      "project": "International Conference",
      "client": "Global Events Inc."
    }
  }
}

In this example, the source media contains English and Spanish, but the transcription has been translated into French.

Transcript Section

The "transcript" object contains the transcription data, including speaker information, segments, and optional styling.

Fields

speakers (array, optional): List of speaker objects.
styles (array, optional): List of style definitions for formatting and positioning.
segments (array, mandatory): List of transcription segments.

Speakers

Each speaker object includes:

id (string, mandatory): Unique identifier for the speaker.
MUST conform to the Speaker ID Requirements specified in the Field Definitions and Constraints section.
name (string, optional): Display name of the speaker.
extensions (object, optional): Any additional information about the speaker.

Example

"speakers": [
  { "id": "Speaker1", "name": "Dr. Smith" },
  { "id": "Speaker2", "name": "Señora García" },
  { "id": "Speaker3", "name": "Monsieur Dupont" },
  { "id": "Speaker4", "name": "Unknown" } // Anonymous speaker
]

Styles

Each style object defines text presentation rules that can be referenced by segments. Basic formatting features are defined in a format-agnostic way, while advanced features can be implemented using extensions.

Required Fields

id (string, mandatory): Unique identifier for the style
MUST be non-empty
MUST be unique within the styles array

Optional Fields

text (object, optional): Text appearance
color (string, optional): Text color in #RRGGBB format
background (string, optional): Background color in #RRGGBB format
bold (boolean, optional): Bold text
italic (boolean, optional): Italic text
underline (boolean, optional): Underlined text
size (string, optional): Size as percentage, e.g. "120%"
display (object, optional): Visual presentation settings
align (string, optional): Text alignment
- Allowed values: "left", "center", "right"
- Default: "left"
vertical (string, optional): Vertical position
- Allowed values: "top", "middle", "bottom"
- Default: "bottom"
position (object, optional): Precise positioning
- x (string, optional): Horizontal position as percentage
- y (string, optional): Vertical position as percentage
extensions (object, optional): Additional styling information
MAY contain format-specific styling properties organized under namespaces.
Format-specific properties SHOULD be placed within appropriately named namespaces in extensions.
Examples:
- Under the custom_webvtt namespace:
json "extensions": { "custom_webvtt": { "line": "auto", "position": "50%", "size": "100%" } }
- Under the custom_ssa namespace:
json "extensions": { "custom_ssa": { "effect": "karaoke", "outline": 2, "shadow": 1 } }

Examples

Basic style:

{
  "id": "speaker_1",
  "text": {
    "color": "#2E4053",
    "bold": true,
    "size": "110%"
  }
}

Style with positioning:

{
  "id": "caption_style",
  "text": {
    "color": "#FFFFFF",
    "background": "#000000"
  },
  "display": {
    "align": "center",
    "vertical": "bottom",
    "position": {
      "x": "50%",
      "y": "90%"
    }
  }
}

Style with format-specific features:

{
  "id": "advanced_style",
  "text": {
    "color": "#FFFFFF"
  },
  "extensions": {
    "custom_ssa": {
      "effect": "karaoke",
      "outline": 2,
      "shadow": 1
    }
  }
}

Segments

Each segment object includes:

start (number, mandatory): Start time of the segment in seconds.
end (number, mandatory): End time of the segment in seconds.
text (string, mandatory): Transcribed text of the segment.
speaker_id (string, optional): The id of the speaker from the speakers list.
confidence (number, optional): Confidence score for the segment (0.0 - 1.0).
language (string, optional): Language code for the segment (ISO 639-1 or ISO 639-3).
style_id (string, optional): The id of the style from the styles list.
words (array, optional): List of word-level details.
start (number, mandatory): Start time of the word in seconds.
end (number, mandatory): End time of the word in seconds.
text (string, mandatory): The word text.
confidence (number, optional): Confidence score for the word (0.0 - 1.0).
word_timing_mode (string, optional): Indicates the completeness of word-level timing data within the segment.
extensions (object, optional): Any additional information about the segment.

Example

"segments": [
  {
    "start": 0.0,
    "end": 5.0,
    "text": "Bonjour tout le monde.",
    "speaker_id": "Speaker1",
    "confidence": 0.95,
    "language": "fr",
    "style_id": "Style1",
    "word_timing_mode": "complete",
    "words": [
      { "start": 0.0, "end": 1.0, "text": "Bonjour" },
      { "start": 1.0, "end": 2.0, "text": "tout" },
      { "start": 2.0, "end": 3.0, "text": "le" },
      { "start": 3.0, "end": 4.0, "text": "monde." }
    ]
  },
  {
    "start": 5.1,
    "end": 10.0,
    "text": "Gracias por estar aquí hoy.",
    "speaker_id": "Speaker2",
    "confidence": 0.93,
    "language": "es",
    "word_timing_mode": "partial",
    "words": [
      { "start": 5.1, "end": 5.5, "text": "Gracias" }
      // Remaining words are not included
    ]
  },
  {
    "start": 10.1,
    "end": 15.0,
    "text": "Hello everyone, and welcome.",
    "speaker_id": "Speaker3",
    "confidence": 0.92,
    "language": "en",
    "word_timing_mode": "none"
    // No words array provided
  }
]

In this example:

The first segment has complete word-level data (word_timing_mode: "complete").
The second segment has partial word-level data (word_timing_mode: "partial").
The third segment has no word-level data (word_timing_mode: "none" or omitted).

Handling Multiple Languages

Global Language Lists:
metadata.source.languages (array of strings, optional):
- Purpose: Lists the languages detected or expected in the source media.
- Usage: Helps in understanding the linguistic content of the source, which is vital for transcription services, translators, and language processing tools.
metadata.languages (array of strings, optional):
- Purpose: Lists the languages present in the transcription data.
- Usage: Indicates which languages are included in the STJ file. This list may differ from metadata.source.languages if the transcription excludes some source languages or includes translations.
Segment-Level Language:
Each segment specifies its language using the language field.
Useful for:
- Multilingual Transcriptions: When the transcription includes multiple languages.
- Translations: When segments have been translated into different languages.

Example Scenario: Translated Transcription

Imagine a video where presenters speak in English and Spanish, and the transcription has been translated entirely into French and German.

"metadata": {
  "transcriber": {
    "name": "YAWT",
    "version": "0.4.0"
  },
  "created_at": "2023-10-20T12:00:00Z",
  "version": "0.5.0",
  "source": {
    "uri": "https://example.com/event.mp4",
    "duration": 5400.0,
    "languages": ["en", "es"]
  },
  "languages": ["fr", "de"],
  "extensions": { ... }
},
"transcript": {
  "segments": [
    {
      "start": 0.0,
      "end": 5.0,
      "text": "Bonjour à tous.",
      "speaker_id": "Speaker1",
      "confidence": 0.95,
      "language": "fr"
    },
    {
      "start": 5.1,
      "end": 10.0,
      "text": "Willkommen alle zusammen.",
      "speaker_id": "Speaker2",
      "confidence": 0.94,
      "language": "de"
    }
    // More segments...
  ]
}

In this example:

The source media languages are English ("en") and Spanish ("es").
The transcription languages are French ("fr") and German ("de").
Each segment indicates the language of the transcribed text.

Optional vs. Mandatory Fields Summary

Mandatory Fields:
metadata.transcriber.name
metadata.transcriber.version
metadata.created_at
metadata.version
transcript.segments (array)
transcript.segments[].start
transcript.segments[].end
transcript.segments[].text
Optional Fields:
All other fields, including speakers, styles, speaker_id, confidence, language, style_id, words, word_timing_mode, etc.

Field Definitions and Constraints

Time Format Requirements

All time values in the STJ format (start and end fields) must follow these requirements:

Format Specifications

Must be represented as non-negative decimal numbers
Must have a precision of up to 3 decimal places (millisecond precision)
Must not exceed 6 significant digits before the decimal point
Values must be in the range [0.000, 999999.999]
Leading zeros before the decimal point are allowed but not required
Trailing zeros after the decimal point are allowed but not required
The decimal point must be present if there are decimal places
Scientific notation is not allowed

Basic Constraints

For any segment or word:
start must not be greater than end
Both start and end must be present and valid according to format specifications
For zero-duration items (start equals end):
Must include is_zero_duration: true
For segments:
- Must not contain a words array
- Must not specify a word_timing_mode
The is_zero_duration field:
Must be true if and only if start equals end
Must be false or omitted for items where start does not equal end
Must not be included with value false when start equals end

Examples of Valid Time Values

0 (zero seconds)
0.0 (zero seconds)
0.000 (zero seconds with full precision)
1.5 (one and a half seconds)
10.100 (ten and one hundred milliseconds)
999999.999 (maximum allowed value)

Examples of Invalid Time Values

-1.0 (negative values not allowed)
1.5e3 (scientific notation not allowed)
1000000.0 (exceeds maximum value)
1.2345 (exceeds maximum precision)
1,5 (incorrect decimal separator)

Character Encoding Requirements

Basic Requirements

Files MUST be encoded in UTF-8
The UTF-8 Byte Order Mark (BOM) is optional
JSON string values MUST follow RFC 8259 encoding rules
The full Unicode character set MUST be supported

String Content Requirements

All string values MUST:
Be valid UTF-8 encoded text
Properly escape control characters (U+0000 through U+001F)
Properly handle surrogate pairs for supplementary plane characters
Forward slash (/) characters MAY be escaped but escaping is not required
Unicode normalization:
All string values SHOULD be normalized to Unicode Normalization Form C (NFC)
Applications MUST preserve the normalization form of input text
Applications MAY normalize text for comparison or search operations

Confidence Scores

Confidence scores are floating-point numbers between 0.0 (no confidence) and 1.0 (full confidence). They are optional but recommended.

Language Codes

ISO 639-1 (two-letter codes) is the primary standard and MUST be used when the language has an ISO 639-1 code.

Example: Use "en" for English, "fr" for French, "es" for Spanish

ISO 639-3 (three-letter codes) MUST only be used for languages that do not have an ISO 639-1 code.

Example: Use "yue" for Cantonese (no ISO 639-1 code), but use "zh" for Mandarin Chinese (has ISO 639-1 code)

Consistency Requirements

A single STJ file MUST NOT mix ISO 639-1 and ISO 639-3 codes for the same language
All references to a specific language within a file MUST use the same code consistently
When a language has both ISO 639-1 and ISO 639-3 codes, the ISO 639-1 code MUST be used

Application Requirements

Applications MUST:

Process both ISO 639-1 and ISO 639-3 codes
Validate that:
ISO 639-1 codes are used when available
ISO 639-3 codes are only used for languages without ISO 639-1 codes
Language codes are used consistently throughout the file
Reject files that:
Use ISO 639-3 codes for languages that have ISO 639-1 codes
Mix different standards for the same language
Contain invalid language codes

URI Format Requirements

Purpose

Defines the format and constraints for the uri field in the metadata.source object.

Format Specifications

Type: String representing a Uniform Resource Identifier (URI) as defined in RFC 3986.
Allowed Schemes:
Recommended:
- http
- https
- file
Optional:
- Other schemes (e.g., ftp, s3, rtsp) MAY be used if appropriate.
Absolute URIs:
The uri SHOULD be an absolute URI, including the scheme component.
Examples:
- "http://example.com/media/video.mp4"
- "https://example.com/media/audio.mp3"
- "file:///C:/Media/video.mp4" (Windows)
- "file:///home/user/media/audio.mp3" (Unix-like systems)
Relative URIs:
Relative URIs or file paths SHOULD NOT be used.
If a relative URI is provided, consuming applications MUST resolve it relative to a known base URI.
Note: Relative URIs can lead to ambiguity and are discouraged.

Validation Rules

The uri MUST conform to the syntax defined in RFC 3986.
Implementations SHOULD validate the URI format and report errors if invalid.
Scheme Support:
Required Support:
- Implementations MUST support http and https schemes.
Optional Support:
- Support for other schemes is OPTIONAL and may vary between implementations.

Security Considerations

Privacy:
Be cautious when including URIs that may reveal sensitive information, such as local file paths or internal network addresses.
Consider omitting the uri or sanitizing it if privacy is a concern.
Security Risks:
Applications consuming STJ files SHOULD handle URIs carefully to avoid security risks such as directory traversal or accessing unauthorized resources.

Examples

HTTP URI:

json "uri": "http://example.com/media/video.mp4"

HTTPS URI:

json "uri": "https://example.com/media/audio.mp3"

File URI (Windows Path):

json "uri": "file:///C:/Media/video.mp4"

File URI (Unix Path):

json "uri": "file:///home/user/media/audio.mp3"

S3 URI (Optional Scheme):

json "uri": "s3://bucket-name/path/to/object"

Speaker IDs

Format Specifications

Type: String
Allowed Characters: Letters (A-Z, a-z), digits (0-9), underscores (_), and hyphens (-).
Length Constraints:
Minimum length: 1 character
Maximum length: 64 characters
Uniqueness:
Speaker IDs MUST be unique within the speakers list.
speaker_id references in segments MUST match an id in the speakers list.
Case Sensitivity:
Speaker IDs are case-sensitive; Speaker1 and speaker1 are considered different IDs.
Format Recommendations:
Use meaningful identifiers when possible, e.g., Speaker_JohnDoe.
For anonymous speakers, use generic IDs like Speaker1, Speaker2, etc.

Representing Anonymous Speakers

When the Speaker is Unknown or Anonymous:
Use a consistent placeholder ID, such as Speaker1, Speaker2, etc.
The name field MAY be omitted or set to a placeholder like "Unknown" or "Anonymous".
Consistency:
Maintain consistent IDs for anonymous speakers throughout the transcript to differentiate between different speakers.
If speaker diarization is uncertain, it is acceptable to assign the same speaker_id to multiple segments where the speaker is believed to be the same.

Examples

Known Speaker:

json { "id": "Speaker_JohnDoe", "name": "John Doe" }

Anonymous Speaker:

json { "id": "Speaker1", "name": "Unknown" }

Multiple Anonymous Speakers:

json "speakers": [ { "id": "Speaker1", "name": "Unknown" }, { "id": "Speaker2", "name": "Unknown" }, { "id": "Speaker3", "name": "Unknown" } ]

Validation Rules

ID Format Validation:
IDs MUST only contain allowed characters.
IDs MUST meet the length constraints.
Uniqueness Validation:
IDs in the speakers list MUST be unique.
Duplicate IDs MUST result in a validation error.
Reference Validation:
All speaker_id references in segments MUST match an id in the speakers list.
Invalid references MUST result in a validation error.

Implementation Notes

Applications SHOULD provide meaningful error messages when validation fails due to speaker ID issues.
When Generating STJ Files:
Ensure that speaker IDs conform to the specified format requirements.
Assign consistent IDs to anonymous speakers to maintain differentiation.

Style IDs

If style_id is used, it must match an id in the styles list.

Text Fields

text fields should be in plain text format. Special formatting or markup should be handled via the styles mechanism.

Word Timing Mode Field

Purpose

Indicates the completeness of word-level timing data within the segment.

Allowed Values

"complete": All words in the segment have timing data
"partial": Only some words have timing data
"none": No word-level timing data is provided

Default Behavior

When omitted and words array is present with complete coverage: treated as "complete"
When omitted and words array is absent: treated as "none"
When omitted and words array is present but incomplete: invalid - must explicitly specify "partial"

Constraints

For "complete": All words must have timing data
For "partial": Some words must have timing data
For "none": Must not include words array

Extensions Field Requirements

Purpose

The extensions field allows for the inclusion of custom, application-specific metadata and format-specific properties without affecting compatibility with other implementations.

Structure

The extensions field, if present, MUST be a JSON object.
Each key in extensions MUST represent a namespace and MUST be a non-empty string.
The value corresponding to each namespace MUST be a JSON object containing key-value pairs specific to that namespace.

Namespaces

Namespace Naming:
Namespaces SHOULD be concise and reflect the application, format, or organization.
Examples include myapp, companyname, customformat.
Reserved Namespaces:
The following namespaces are RESERVED for future use by the STJ specification and MUST NOT be used for custom data:
- stj (reserved for STJ specification extensions)
- webvtt (reserved for WebVTT format mappings)
- ttml (reserved for TTML format mappings)
- ssa (reserved for SSA/ASS format mappings)
- srt (reserved for SubRip format mappings)
- dfxp (reserved for DFXP/Timed Text format mappings)
- smptett (reserved for SMPTE-TT format mappings)
Developer Guidance:
Developers who need to include format-specific properties before official definitions are available may:
- Use a custom namespace that clearly indicates its provisional nature, such as custom_webvtt or experimental_ttml.
- Be prepared to migrate their data to the official namespace once the STJ specification provides the definitions.

Constraints

Applications MUST ignore any namespaces in extensions that they do not recognize.
The extensions field SHOULD NOT include essential data required for basic functionality.
Nested objects and arrays ARE ALLOWED within each namespace.
Keys within namespaces MUST NOT duplicate or conflict with standard fields of the containing object.
Reserved Namespaces Validation:
Namespaces listed as RESERVED in the specification MUST NOT be used by applications for custom data.
Applications MUST report an error if a reserved namespace is used.

Examples

In a segment object:

json "extensions": { "myapp": { "custom_property": "value", "analysis_data": { "sentiment_score": 0.85, "keywords": ["innovation", "technology"] } }, "analytics": { "emotion": "happy", "confidence": 0.9 } }

In a style object with format-specific properties:

json { "id": "caption_style", "text": { "color": "#FFFFFF", "background": "#000000" }, "display": { "align": "center", "vertical": "bottom" }, "extensions": { "custom_webvtt": { "line": "auto", "position": "50%", "size": "100%" }, "myapp": { "custom_style_property": "value" } } }

In a metadata object:

json "extensions": { "project_info": { "project": "International Conference", "client": "Global Events Inc." }, "notes": { "review_status": "approved", "reviewer": "John Doe" } }

Note: Standard fields defined in the STJ specification MUST NOT be duplicated within any namespace in extensions. For example, including a key "start" within a namespace is prohibited if it conflicts with the mandatory "start" field of the segment.

Implementation Requirements

Time Value Processing

Implementations MUST:

Parse time values with up to 3 decimal places
Preserve the precision of input values up to 3 decimal places
Round any input with more than 3 decimal places to 3 decimal places using IEEE 754 round-to-nearest-even
Validate all time values according to the Time Format Requirements section

Implementations MUST reject files that contain any of the following:

Negative time values
Values exceeding 999999.999 seconds
Time values using scientific notation
Overlapping segments

Validation Requirements

Segment-Level Validation

Required Fields:
start and end times MUST conform to the Time Format Requirements section
text MUST be present and non-empty
References:
speaker_id, if present, MUST match an id in the speakers list
style_id, if present, MUST match an id in the styles list
Segment Ordering:
Segments MUST be ordered by their start times in ascending order
For segments with identical start times, they MUST be ordered by their end times in ascending order
Segment Overlap:
Segments MUST NOT overlap in time
For any two segments S1 and S2 where S1 appears before S2 in the segments array:
- S1.end MUST be less than or equal to S2.start
- Examples of valid segment ordering:
- Adjacent segments: S1(0.0, 1.0), S2(1.0, 2.0)
- Segments with gap: S1(0.0, 1.0), S2(2.0, 3.0)
- Examples of invalid segment ordering:
- Overlapping segments: S1(0.0, 2.0), S2(1.0, 3.0)
- Out of order segments: S1(1.0, 2.0), S2(0.0, 3.0)
Zero-Duration Segments:
MUST follow the zero-duration requirements defined in the Time Format Requirements section
Zero-duration segments MAY share the same timestamp

Word-Level Validation

When words array is present:
Each word object must have text, start, and end
All time values must conform to the Time Format Requirements section
Word timing constraints:
- Word times must be within the parent segment's time range
- Words must be ordered by start time
- Word timings must not overlap
Word Timing Mode Requirements:
When "complete" (or omitted with complete coverage):
- The concatenation of all text fields in words must match the segment's text, except for differences in whitespace or punctuation
When "partial":
- The text fields in words must be a subset of the words in the segment's text, in the same order
When "none" (or omitted without words array):
- Must not contain words array
Zero-Duration Words:
Must follow the zero-duration requirements defined in the Time Format Requirements section

General Validation

URI Validation Requirements:
The uri field in metadata.source MUST conform to the URI Format Requirements specified in the Field Definitions and Constraints section.
Implementations SHOULD validate the URI format according to RFC 3986.
Invalid URIs SHOULD result in a validation error or warning.
Scheme Support:
- Implementations MUST support http and https schemes.
- Support for other schemes is OPTIONAL.
Relative URIs:
- Relative URIs SHOULD NOT be used.
- If present, they MUST be resolved relative to a known base URI by the consuming application.
Language Code Requirements:
All language codes must be valid ISO 639 codes
Language codes must be consistent with the requirements defined in the Language Codes section
Confidence Score Requirements:
Confidence scores, if present, must be within the range [0.0, 1.0]
Reference Requirements:
Speaker IDs:
- All speaker_id references in segments MUST correspond to valid id entries in the speakers list.
- All id values in the speakers list MUST conform to the Speaker ID Requirements specified in the Field Definitions and Constraints section.
- All IDs in the speakers array MUST be unique.
- IDs MUST only contain allowed characters and meet length constraints.
Style IDs:
- All style_id references must correspond to valid entries in the styles list.
- All IDs in the styles array MUST be unique.
Character Encoding Requirements:
All text content must be valid UTF-8
All JSON string values must follow RFC 8259 encoding rules
Control characters must be properly escaped
Surrogate pairs must be properly formed
BOM must be handled correctly if present

Extensions Field Validation

Structure Validation:
The extensions field, if present, MUST be a JSON object.
Namespaces MUST be strings and MUST NOT be empty.
Values corresponding to namespaces MUST be JSON objects.
Reserved Namespaces Validation:
Namespaces listed as RESERVED in the specification MUST NOT be used by applications for custom data.
Applications MUST report an error if a reserved namespace is used.
Content Validation:
Applications MUST ignore any namespaces or keys within extensions that they do not recognize.
Values within namespaces MAY be validated based on application-specific requirements.
Conflict Resolution:
If a key within a namespace in extensions conflicts with a standard field, the standard field's value MUST take precedence.
Applications MUST report an error if a conflict is detected.

Style Processing

Implementations:

MAY support none, some, or all style properties
MUST ignore style properties they don't support
MUST document which style properties they support
SHOULD provide reasonable fallback behavior for unsupported properties

When converting STJ to other formats, implementations:

SHOULD document how STJ style properties map to the target format's capabilities
MAY omit style properties that cannot be represented in the target format

Advanced styling features (such as animations, karaoke effects, or complex positioning) SHOULD be implemented using format-specific properties within the extensions field under a namespace corresponding to the format.

Representing Confidence

Confidence scores provide an indication of the reliability of the transcribed text. They can be used to:

Highlight low-confidence segments for manual review.
Filter out words or segments below a certain confidence threshold.
Provide visual cues in transcription editors or viewers.

Including both segment-level and word-level confidence scores allows applications to present detailed insights into transcription accuracy.

Comparison with Existing Formats

The STJ format is designed to be a superset of common transcription and subtitle formats, incorporating their features and extending them where necessary.

SRT (SubRip)

Sequence Numbers: Not used in STJ, as sequence is implied by array order.
Timestamps: STJ uses precise start and end times in seconds.
Text: Supported via text field.
Styling: Limited in SRT; STJ supports styling via styles and style_id.

WebVTT

Text Formatting: Basic formatting supported via core style properties
Positioning: Basic positioning supported via display properties
Advanced Features: Can be represented via properties within the custom_webvtt namespace in extensions

TTML (Timed Text Markup Language)

Basic Styling: Supported via core style properties
Advanced Features: Can be represented via properties within the custom_ttml namespace in extensions
Multiple Languages: Supported via segment-level language codes

SSA/ASS

Basic Styling: Supported via core style properties
Advanced Features: Can be represented via properties within the custom_ssa namespace in extensions

Usage in Applications

The STJ format is designed to be easily parsed and utilized by a variety of applications, such as:

Transcription Editors: Tools can load STJ files to display transcriptions with speaker labels, timestamps, and styling.
Subtitle Generators: Applications can convert STJ segments into subtitle formats like SRT or WebVTT.
Speech Analytics: Analyze transcriptions for sentiment, keyword extraction, or topic modeling.
Quality Assurance: Reviewers can focus on low-confidence segments for correction.
Multilingual Support: Applications can handle multilingual transcriptions by leveraging per-segment language data.

Extensibility and Customization

Additional Metadata: Use the extensions fields in both metadata and individual objects to include custom data without affecting compatibility.
Versioning: Include a version field in metadata if needed for future format updates.
Custom Fields: Applications can add custom data within appropriately named namespaces in the extensions field to include application-specific data without affecting compatibility.

Adherence to Best Practices

The STJ format follows best practices for data interchange formats, drawing inspiration from established standards like:

IETF RFC 8259: The STJ format adheres to the JSON standard as specified in RFC 8259.
ISO 639 Language Codes: Uses standard language codes to ensure consistency.
Dublin Core Metadata Initiative (DCMI): The metadata fields are designed to align with DCMI principles where applicable.
Naming Conventions: Field names are concise and use lowercase letters with underscores for readability.
Extensibility: The format allows for future expansion without breaking existing implementations.

Final Remarks

The STJ format aims to be a comprehensive and flexible standard for transcription data representation. By incorporating features from existing formats and adhering to best practices, it strives to meet the needs of a wide range of applications and facilitate better interoperability in the field of speech transcription and subtitles.

Note: This specification is open for suggestions and improvements. Contributions from the community are welcome to refine and enhance the STJ format.

Contact: For feedback or contributions, please reach out via The STJ Repository.

Standard Transcription JSON (STJ) Format Specification

Introduction

Objectives

Specification

Version History

Overview

Mandatory vs. Optional Fields

Metadata Section

Fields

Clarification on languages Fields

Example

Transcript Section

Fields

Speakers

Example

Styles

Required Fields

Optional Fields

Examples

Segments

Example

Handling Multiple Languages

Example Scenario: Translated Transcription

Optional vs. Mandatory Fields Summary

Field Definitions and Constraints

Time Format Requirements

Format Specifications

Basic Constraints

Examples of Valid Time Values

Examples of Invalid Time Values

Character Encoding Requirements

Basic Requirements

String Content Requirements

Confidence Scores

Language Codes

Consistency Requirements

Application Requirements

URI Format Requirements

Purpose

Format Specifications

Validation Rules

Security Considerations

Examples

Speaker IDs

Format Specifications

Representing Anonymous Speakers

Examples

Validation Rules

Implementation Notes

Style IDs

Text Fields

Word Timing Mode Field

Purpose

Allowed Values

Default Behavior

Constraints

Extensions Field Requirements

Purpose

Structure

Namespaces

Constraints

Examples

Implementation Requirements

Time Value Processing

Validation Requirements

Segment-Level Validation

Word-Level Validation

General Validation

Extensions Field Validation

Style Processing

Representing Confidence

Comparison with Existing Formats

SRT (SubRip)

WebVTT

TTML (Timed Text Markup Language)

SSA/ASS

Usage in Applications

Extensibility and Customization

Adherence to Best Practices

Final Remarks

Clarification on `languages` Fields