Standard Transcription JSON (STJ) Format Specification
Version: 0.4
Date: 2024-10-22
Introduction
The Standard Transcription JSON (STJ) format is a proposed standard for representing transcribed audio and video data in a structured, machine-readable JSON format. It aims to provide a comprehensive and flexible framework that is a superset of existing transcription and subtitle formats such as SRT, WebVTT, TTML, SSA/ASS, and others.
The STJ format includes detailed transcription segments with associated metadata such as speaker information, timestamps, confidence scores, language codes, and styling options. It also allows for optional metadata about the transcription process, source input, and the transcriber application.
File Extension: .stj.json
MIME Type: application/vnd.stj+json
Objectives
- Interoperability: Enable seamless data exchange between different transcription services and applications.
- Superset of Existing Formats: Incorporate features from common formats (SRT, WebVTT, TTML, etc.) to ensure compatibility and extensibility.
- Extensibility: Allow for future enhancements without breaking compatibility.
- Clarity: Provide a clear and well-documented structure for transcription data.
- Utility: Include useful metadata to support a wide range of use cases.
- Best Practices Compliance: Adhere to state-of-the-art best practices in metadata representation and documentation standards.
Specification
Version History
Version 0.4 Changes:
- Added
word_timing_modefield in segments to indicate the completeness of word-level timing data. - Clarified the relationship between segment-level text and word-level details, accounting for
word_timing_mode. - Specified validation requirements for all parts of the JSON, including segments, words, speakers, styles, and additional fields.
- Provided additional examples demonstrating the use of
word_timing_mode.
Overview
The STJ file is a JSON object containing two main sections:
"metadata": Contains information about the transcription process, source input, and other relevant details."transcript": Contains the actual transcription data, including speaker information, segments, and optional styling.
{
"metadata": { ... },
"transcript": { ... }
}
Mandatory vs. Optional Fields
- Mandatory Fields: Essential for basic functionality and compatibility.
- Optional Fields: Provide additional information and features but are not required for basic use.
Metadata Section
The "metadata" object includes optional and required fields providing context about the transcription.
Fields
- transcriber (mandatory): Information about the transcriber application or service.
- name (string, mandatory): Name of the transcriber application.
- version (string, mandatory): Version of the transcriber application.
- created_at (string, mandatory): ISO 8601 timestamp indicating when the transcription was created.
- source (optional): Information about the source of the audio/video.
- uri (string, optional): The URI or file path of the source media.
- duration (number, optional): Duration of the media in seconds.
- languages (array of strings, optional): List of languages present in the source media, ordered by prevalence.
- languages (array of strings, optional): List of languages present in the transcription, ordered by prevalence.
- confidence_threshold (number, optional): Confidence threshold used during transcription (0.0 - 1.0).
- additional_info (object, optional): A key-value map for any additional metadata.
Clarification on languages Fields
The STJ format includes two languages fields within the metadata section to distinguish between the languages present in the source media and those represented in the transcription.
metadata.source.languages(array of strings, optional):- Definition: Languages expected or detected in the source media.
- Purpose: Indicates the original languages spoken in the audio/video content.
-
Use Case: Useful for applications that need to know what languages are present in the source, perhaps for transcription, translation, or language detection purposes.
-
metadata.languages(array of strings, optional): - Definition: Languages present in the transcription.
- Purpose: Indicates the languages included in the transcription data within the STJ file.
- Use Case: Essential for applications processing the transcription to know what languages they need to handle. This list may differ from
metadata.source.languagesif the transcription excludes some source languages or includes translations into new languages.
Example
"metadata": {
"transcriber": {
"name": "YAWT",
"version": "0.4.0"
},
"created_at": "2023-10-20T12:00:00Z",
"source": {
"uri": "https://example.com/multilingual_media.mp4",
"duration": 3600.5,
"languages": ["en", "es"] // Source languages: English and Spanish
},
"languages": ["fr"], // Transcription language: French
"confidence_threshold": 0.6,
"additional_info": {
"project": "International Conference",
"client": "Global Events Inc."
}
}
In this example, the source media contains English and Spanish, but the transcription has been translated into French.
Transcript Section
The "transcript" object contains the transcription data, including speaker information, segments, and optional styling.
Fields
- speakers (array, optional): List of speaker objects.
- styles (array, optional): List of style definitions for formatting and positioning.
- segments (array, mandatory): List of transcription segments.
Speakers
Each speaker object includes:
- id (string, mandatory): Unique identifier for the speaker.
- name (string, optional): Display name of the speaker.
- additional_info (object, optional): Any additional information about the speaker.
Example
"speakers": [
{ "id": "Speaker1", "name": "Dr. Smith" },
{ "id": "Speaker2", "name": "Señora García" },
{ "id": "Speaker3", "name": "Monsieur Dupont" }
]
Styles
Each style object includes:
- id (string, mandatory): Unique identifier for the style.
- formatting (object, optional): Text formatting options (e.g., bold, italic).
- positioning (object, optional): On-screen positioning options.
- additional_info (object, optional): Any additional information about the style.
Example
"styles": [
{
"id": "Style1",
"formatting": {
"bold": true,
"italic": false,
"underline": false,
"color": "#FFFFFF",
"background_color": "#000000"
},
"positioning": {
"align": "center",
"line": "auto",
"position": "50%",
"size": "100%"
}
}
]
Segments
Each segment object includes:
- start (number, mandatory): Start time of the segment in seconds.
- end (number, mandatory): End time of the segment in seconds.
- text (string, mandatory): Transcribed text of the segment.
- speaker_id (string, optional): The
idof the speaker from thespeakerslist. - confidence (number, optional): Confidence score for the segment (0.0 - 1.0).
- language (string, optional): Language code for the segment (ISO 639-1 or ISO 639-3).
- style_id (string, optional): The
idof the style from thestyleslist. - words (array, optional): List of word-level details.
- start (number, mandatory): Start time of the word in seconds.
- end (number, mandatory): End time of the word in seconds.
- text (string, mandatory): The word text.
- confidence (number, optional): Confidence score for the word (0.0 - 1.0).
- word_timing_mode (string, optional): Indicates the completeness of word-level timing data within the segment.
- additional_info (object, optional): Any additional information about the segment.
Example
"segments": [
{
"start": 0.0,
"end": 5.0,
"text": "Bonjour tout le monde.",
"speaker_id": "Speaker1",
"confidence": 0.95,
"language": "fr",
"style_id": "Style1",
"word_timing_mode": "complete",
"words": [
{ "start": 0.0, "end": 1.0, "text": "Bonjour" },
{ "start": 1.0, "end": 2.0, "text": "tout" },
{ "start": 2.0, "end": 3.0, "text": "le" },
{ "start": 3.0, "end": 4.0, "text": "monde." }
]
},
{
"start": 5.1,
"end": 10.0,
"text": "Gracias por estar aquí hoy.",
"speaker_id": "Speaker2",
"confidence": 0.93,
"language": "es",
"word_timing_mode": "partial",
"words": [
{ "start": 5.1, "end": 5.5, "text": "Gracias" }
// Remaining words are not included
]
},
{
"start": 10.1,
"end": 15.0,
"text": "Hello everyone, and welcome.",
"speaker_id": "Speaker3",
"confidence": 0.92,
"language": "en",
"word_timing_mode": "none"
// No words array provided
}
]
In this example:
- The first segment has complete word-level data (
word_timing_mode:"complete"). - The second segment has partial word-level data (
word_timing_mode:"partial"). - The third segment has no word-level data (
word_timing_mode:"none"or omitted).
Handling Multiple Languages
-
Global Language Lists:
-
metadata.source.languages(array of strings, optional):- Purpose: Lists the languages detected or expected in the source media.
- Usage: Helps in understanding the linguistic content of the source, which is vital for transcription services, translators, and language processing tools.
-
metadata.languages(array of strings, optional):- Purpose: Lists the languages present in the transcription data.
- Usage: Indicates which languages are included in the STJ file. This list may differ from
metadata.source.languagesif the transcription excludes some source languages or includes translations.
-
Segment-Level Language:
-
Each segment specifies its language using the
languagefield. - Useful for:
- Multilingual Transcriptions: When the transcription includes multiple languages.
- Translations: When segments have been translated into different languages.
Example Scenario: Translated Transcription
Imagine a video where presenters speak in English and Spanish, and the transcription has been translated entirely into French and German.
"metadata": {
"transcriber": {
"name": "YAWT",
"version": "0.4.0"
},
"created_at": "2023-10-20T12:00:00Z",
"source": {
"uri": "https://example.com/event.mp4",
"duration": 5400.0,
"languages": ["en", "es"]
},
"languages": ["fr", "de"],
"additional_info": { ... }
},
"transcript": {
"segments": [
{
"start": 0.0,
"end": 5.0,
"text": "Bonjour à tous.",
"speaker_id": "Speaker1",
"confidence": 0.95,
"language": "fr"
},
{
"start": 5.1,
"end": 10.0,
"text": "Willkommen alle zusammen.",
"speaker_id": "Speaker2",
"confidence": 0.94,
"language": "de"
}
// More segments...
]
}
In this example:
- The source media languages are English (
"en") and Spanish ("es"). - The transcription languages are French (
"fr") and German ("de"). - Each segment indicates the language of the transcribed text.
Optional vs. Mandatory Fields Summary
- Mandatory Fields:
metadata.transcriber.namemetadata.transcriber.versionmetadata.created_attranscript.segments(array)transcript.segments[].starttranscript.segments[].end-
transcript.segments[].text -
Optional Fields:
- All other fields, including
speakers,styles,speaker_id,confidence,language,style_id,words,word_timing_mode, etc.
Field Definitions and Constraints
- Time Fields:
- All time-related fields (
start,end) are in seconds and can have fractional values to represent milliseconds. - Constraints:
startmust not be greater thanend.- Segments should not overlap in time.
- For zero-duration words or segments (
startequalsend), include the appropriate duration field (word_durationorsegment_duration) set to"zero".
- Confidence Scores:
- Confidence scores are floating-point numbers between
0.0(no confidence) and1.0(full confidence). They are optional but recommended. - Language Codes:
- Use ISO 639-1 (two-letter codes) or ISO 639-3 (three-letter codes) for language representation.
- Speaker IDs:
- If
speaker_idis used, it must match anidin thespeakerslist. - Style IDs:
- If
style_idis used, it must match anidin thestyleslist. - Text Fields:
textfields should be in plain text format. Special formatting or markup should be handled via thestylesmechanism.word_timing_modeField:- Purpose: Indicates the completeness of word-level timing data within the segment.
- Allowed Values:
"complete": All words in the segment have timing data."partial": Only some words have timing data."none": No word-level timing data is provided.
- Constraints:
- If
wordsarray is present and covers all words,word_timing_modemay be omitted (defaulting to"complete"). - If
wordsarray is present but does not cover all words,word_timing_modemust be set to"partial". - If
wordsarray is absent,word_timing_modeshould be"none"or omitted.
- If
Validation Requirements
Segment-Level Validation
- Required Fields:
startandendtimes are present andstart≤end.textis present and non-empty.- References:
speaker_id, if present, must match anidin thespeakerslist.style_id, if present, must match anidin thestyleslist.- Timing:
- Segments should not overlap in time.
- Zero-Length Segments:
- If
startequalsend, includesegment_durationset to"zero"inadditional_info.
Word-Level Validation
- When
wordsarray is present: - Each word object must have
text,start, andend. - Word
startandendtimes must be within the segment'sstartandendtimes. - Words should be ordered by
starttime. - Word timings should not overlap.
word_timing_modeField:- When
word_timing_modeis"complete"or omitted:- The concatenation of all
textfields inwordsmust match the segment'stext, except for differences in whitespace or punctuation.
- The concatenation of all
- When
word_timing_modeis"partial":- The
textfields inwordsmust be a subset of the words in the segment'stext, in the same order.
- The
- Zero-Length Words:
- If a word's
startequalsend, includeword_durationset to"zero"inadditional_info.
Overall Consistency
- Language Codes:
- All language codes must be valid ISO 639 codes.
- Confidence Scores:
- Confidence scores, if present, must be within the range [0.0, 1.0].
- References:
- All
speaker_idandstyle_idreferences must correspond to valid entries in thespeakersandstyleslists, respectively. - Unique IDs:
- All IDs used in
speakersandstylesmust be unique within their respective arrays.
Additional Validation
- Time Fields:
- All time fields (
start,end) must be non-negative numbers. - Segment Ordering:
- Segments should be ordered by their
starttimes. - No Overlapping Segments:
- Segments should not overlap in time.
- Additional Info Fields:
- The
additional_infofield, if used, should be an object containing key-value pairs.
Representing Confidence
Confidence scores provide an indication of the reliability of the transcribed text. They can be used to:
- Highlight low-confidence segments for manual review.
- Filter out words or segments below a certain confidence threshold.
- Provide visual cues in transcription editors or viewers.
Including both segment-level and word-level confidence scores allows applications to present detailed insights into transcription accuracy.
Comparison with Existing Formats
The STJ format is designed to be a superset of common transcription and subtitle formats, incorporating their features and extending them where necessary.
SRT (SubRip)
- Sequence Numbers: Not used in STJ, as sequence is implied by array order.
- Timestamps: STJ uses precise start and end times in seconds.
- Text: Supported via
textfield. - Styling: Limited in SRT; STJ supports styling via
stylesandstyle_id.
WebVTT
- Text Formatting: STJ can represent formatting via
styles. - Positioning: Supported via
styles.positioning. - Cue Settings: Can be represented in
styles.
TTML (Timed Text Markup Language)
- Complex Styling: STJ can represent complex styles and formatting.
- Layout and Regions: Can be mapped using
stylesand positioning options. - Multiple Languages: STJ supports per-segment language codes.
SSA/ASS
- Advanced Styling: STJ supports advanced styling through
styles. - Karaoke Effects: Not directly supported but can be extended via
additional_info.
Usage in Applications
The STJ format is designed to be easily parsed and utilized by a variety of applications, such as:
- Transcription Editors: Tools can load STJ files to display transcriptions with speaker labels, timestamps, and styling.
- Subtitle Generators: Applications can convert STJ segments into subtitle formats like SRT or WebVTT.
- Speech Analytics: Analyze transcriptions for sentiment, keyword extraction, or topic modeling.
- Quality Assurance: Reviewers can focus on low-confidence segments for correction.
- Multilingual Support: Applications can handle multilingual transcriptions by leveraging per-segment language data.
Extensibility and Customization
- Additional Metadata: Use the
additional_infofields in bothmetadataand individual objects to include custom data without affecting compatibility. - Versioning: Include a
versionfield inmetadataif needed for future format updates. - Custom Fields: Applications can add custom fields prefixed with
x_to include application-specific data without affecting compatibility.
Adherence to Best Practices
The STJ format follows best practices for data interchange formats, drawing inspiration from established standards like:
- IETF RFC 8259: The STJ format adheres to the JSON standard as specified in RFC 8259.
- ISO 639 Language Codes: Uses standard language codes to ensure consistency.
- Dublin Core Metadata Initiative (DCMI): The metadata fields are designed to align with DCMI principles where applicable.
- Naming Conventions: Field names are concise and use lowercase letters with underscores for readability.
- Extensibility: The format allows for future expansion without breaking existing implementations.
Final Remarks
The STJ format aims to be a comprehensive and flexible standard for transcription data representation. By incorporating features from existing formats and adhering to best practices, it strives to meet the needs of a wide range of applications and facilitate better interoperability in the field of speech transcription and subtitles.
Note: This specification is open for suggestions and improvements. Contributions from the community are welcome to refine and enhance the STJ format.
Contact: For feedback or contributions, please reach out via The STJ Repository.