Standard Transcription JSON (STJ) Format Specification
Version: 0.5 Date: 2024-10-24
Introduction
The Standard Transcription JSON (STJ) format is a proposed standard for representing transcribed audio and video data in a structured, machine-readable JSON format. It aims to provide a comprehensive and flexible framework that is a superset of existing transcription and subtitle formats such as SRT, WebVTT, TTML, SSA/ASS, and others.
The STJ format includes detailed transcription segments with associated metadata such as speaker information, timestamps, confidence scores, language codes, and styling options. It also allows for optional metadata about the transcription process, source input, and the transcriber application.
File Extension: .stj.json
MIME Type: application/vnd.stj+json
Character Encoding: UTF-8
Objectives
- Interoperability: Enable seamless data exchange between different transcription services and applications.
- Superset of Existing Formats: Incorporate features from common formats (SRT, WebVTT, TTML, etc.) to ensure compatibility and extensibility.
- Extensibility: Allow for future enhancements without breaking compatibility.
- Clarity: Provide a clear and well-documented structure for transcription data.
- Utility: Include useful metadata to support a wide range of use cases.
- Best Practices Compliance: Adhere to state-of-the-art best practices in metadata representation and documentation standards.
Specification
The STJ files must include a version field within the metadata section to indicate the specification version they comply with. This facilitates compatibility and proper validation across different implementations.
Version History
For a detailed list of changes between versions, please see the CHANGELOG.md file.
Overview
The STJ file is a JSON object containing two main sections:
"metadata": Contains information about the transcription process, source input, and other relevant details."transcript": Contains the actual transcription data, including speaker information, segments, and optional styling.
{
"metadata": { ... },
"transcript": { ... }
}
Mandatory vs. Optional Fields
- Mandatory Fields: Essential for basic functionality and compatibility.
- Optional Fields: Provide additional information and features but are not required for basic use.
Metadata Section
The "metadata" object includes optional and required fields providing context about the transcription.
Fields
- transcriber (mandatory): Information about the transcriber application or service.
- name (string, mandatory): Name of the transcriber application.
- version (string, mandatory): Version of the transcriber application.
- created_at (string, mandatory): ISO 8601 timestamp indicating when the transcription was created.
- version (mandatory): Specifies the STJ specification version the file adheres to.
- Format: Semantic versioning (e.g.,
"0.5.0") - Pattern: Must follow the regex pattern
^\d+\.\d+\.\d+$to ensure semantic versioning. - source (optional): Information about the source of the audio/video.
- uri (string, optional): The URI of the source media.
- MUST conform to the URI Format Requirements specified in the Field Definitions and Constraints section.
- duration (number, optional): Duration of the media in seconds.
- languages (array of strings, optional): List of languages present in the source media, ordered by prevalence.
- languages (array of strings, optional): List of languages present in the transcription, ordered by prevalence.
- confidence_threshold (number, optional): Confidence threshold used during transcription (0.0 - 1.0).
- extensions (object, optional): A key-value map for any additional metadata.
- MUST conform to the Extensions Field Requirements specified in the Field Definitions and Constraints section.
Clarification on languages Fields
The STJ format includes two languages fields within the metadata section to distinguish between the languages present in the source media and those represented in the transcription.
metadata.source.languages(array of strings, optional):- Definition: Languages expected or detected in the source media.
- Purpose: Indicates the original languages spoken in the audio/video content.
-
Use Case: Useful for applications that need to know what languages are present in the source, perhaps for transcription, translation, or language detection purposes.
-
metadata.languages(array of strings, optional): - Definition: Languages present in the transcription.
- Purpose: Indicates the languages included in the transcription data within the STJ file.
- Use Case: Essential for applications processing the transcription to know what languages they need to handle. This list may differ from
metadata.source.languagesif the transcription excludes some source languages or includes translations into new languages.
Example
"metadata": {
"transcriber": {
"name": "YAWT",
"version": "0.4.0"
},
"created_at": "2023-10-20T12:00:00Z",
"version": "0.5.0",
"source": {
"uri": "https://example.com/multilingual_media.mp4",
"duration": 3600.5,
"languages": ["en", "es"] // Source languages: English and Spanish
},
"languages": ["fr"], // Transcription language: French
"confidence_threshold": 0.6,
"extensions": {
"project_info": {
"project": "International Conference",
"client": "Global Events Inc."
}
}
}
In this example, the source media contains English and Spanish, but the transcription has been translated into French.
Transcript Section
The "transcript" object contains the transcription data, including speaker information, segments, and optional styling.
Fields
- speakers (array, optional): List of speaker objects.
- styles (array, optional): List of style definitions for formatting and positioning.
- segments (array, mandatory): List of transcription segments.
Speakers
Each speaker object includes:
- id (string, mandatory): Unique identifier for the speaker.
- MUST conform to the Speaker ID Requirements specified in the Field Definitions and Constraints section.
- name (string, optional): Display name of the speaker.
- extensions (object, optional): Any additional information about the speaker.
Example
"speakers": [
{ "id": "Speaker1", "name": "Dr. Smith" },
{ "id": "Speaker2", "name": "Señora García" },
{ "id": "Speaker3", "name": "Monsieur Dupont" },
{ "id": "Speaker4", "name": "Unknown" } // Anonymous speaker
]
Styles
Each style object defines text presentation rules that can be referenced by segments. Basic formatting features are defined in a format-agnostic way, while advanced features can be implemented using extensions.
Required Fields
- id (string, mandatory): Unique identifier for the style
- MUST be non-empty
- MUST be unique within the styles array
Optional Fields
- text (object, optional): Text appearance
- color (string, optional): Text color in #RRGGBB format
- background (string, optional): Background color in #RRGGBB format
- bold (boolean, optional): Bold text
- italic (boolean, optional): Italic text
- underline (boolean, optional): Underlined text
-
size (string, optional): Size as percentage, e.g. "120%"
-
display (object, optional): Visual presentation settings
- align (string, optional): Text alignment
- Allowed values:
"left","center","right" - Default:
"left"
- Allowed values:
- vertical (string, optional): Vertical position
- Allowed values:
"top","middle","bottom" - Default:
"bottom"
- Allowed values:
-
position (object, optional): Precise positioning
- x (string, optional): Horizontal position as percentage
- y (string, optional): Vertical position as percentage
-
extensions (object, optional): Additional styling information
- MAY contain format-specific styling properties organized under namespaces.
- Format-specific properties SHOULD be placed within appropriately named namespaces in
extensions. -
Examples:
- Under the
custom_webvttnamespace:
json "extensions": { "custom_webvtt": { "line": "auto", "position": "50%", "size": "100%" } }- Under the
custom_ssanamespace:
json "extensions": { "custom_ssa": { "effect": "karaoke", "outline": 2, "shadow": 1 } } - Under the
Examples
Basic style:
{
"id": "speaker_1",
"text": {
"color": "#2E4053",
"bold": true,
"size": "110%"
}
}
Style with positioning:
{
"id": "caption_style",
"text": {
"color": "#FFFFFF",
"background": "#000000"
},
"display": {
"align": "center",
"vertical": "bottom",
"position": {
"x": "50%",
"y": "90%"
}
}
}
Style with format-specific features:
{
"id": "advanced_style",
"text": {
"color": "#FFFFFF"
},
"extensions": {
"custom_ssa": {
"effect": "karaoke",
"outline": 2,
"shadow": 1
}
}
}
Segments
Each segment object includes:
- start (number, mandatory): Start time of the segment in seconds.
- end (number, mandatory): End time of the segment in seconds.
- text (string, mandatory): Transcribed text of the segment.
- speaker_id (string, optional): The
idof the speaker from thespeakerslist. - confidence (number, optional): Confidence score for the segment (0.0 - 1.0).
- language (string, optional): Language code for the segment (ISO 639-1 or ISO 639-3).
- style_id (string, optional): The
idof the style from thestyleslist. - words (array, optional): List of word-level details.
- start (number, mandatory): Start time of the word in seconds.
- end (number, mandatory): End time of the word in seconds.
- text (string, mandatory): The word text.
- confidence (number, optional): Confidence score for the word (0.0 - 1.0).
- word_timing_mode (string, optional): Indicates the completeness of word-level timing data within the segment.
- extensions (object, optional): Any additional information about the segment.
Example
"segments": [
{
"start": 0.0,
"end": 5.0,
"text": "Bonjour tout le monde.",
"speaker_id": "Speaker1",
"confidence": 0.95,
"language": "fr",
"style_id": "Style1",
"word_timing_mode": "complete",
"words": [
{ "start": 0.0, "end": 1.0, "text": "Bonjour" },
{ "start": 1.0, "end": 2.0, "text": "tout" },
{ "start": 2.0, "end": 3.0, "text": "le" },
{ "start": 3.0, "end": 4.0, "text": "monde." }
]
},
{
"start": 5.1,
"end": 10.0,
"text": "Gracias por estar aquí hoy.",
"speaker_id": "Speaker2",
"confidence": 0.93,
"language": "es",
"word_timing_mode": "partial",
"words": [
{ "start": 5.1, "end": 5.5, "text": "Gracias" }
// Remaining words are not included
]
},
{
"start": 10.1,
"end": 15.0,
"text": "Hello everyone, and welcome.",
"speaker_id": "Speaker3",
"confidence": 0.92,
"language": "en",
"word_timing_mode": "none"
// No words array provided
}
]
In this example:
- The first segment has complete word-level data (
word_timing_mode:"complete"). - The second segment has partial word-level data (
word_timing_mode:"partial"). - The third segment has no word-level data (
word_timing_mode:"none"or omitted).
Handling Multiple Languages
-
Global Language Lists:
-
metadata.source.languages(array of strings, optional):- Purpose: Lists the languages detected or expected in the source media.
- Usage: Helps in understanding the linguistic content of the source, which is vital for transcription services, translators, and language processing tools.
-
metadata.languages(array of strings, optional):- Purpose: Lists the languages present in the transcription data.
- Usage: Indicates which languages are included in the STJ file. This list may differ from
metadata.source.languagesif the transcription excludes some source languages or includes translations.
-
Segment-Level Language:
-
Each segment specifies its language using the
languagefield. - Useful for:
- Multilingual Transcriptions: When the transcription includes multiple languages.
- Translations: When segments have been translated into different languages.
Example Scenario: Translated Transcription
Imagine a video where presenters speak in English and Spanish, and the transcription has been translated entirely into French and German.
"metadata": {
"transcriber": {
"name": "YAWT",
"version": "0.4.0"
},
"created_at": "2023-10-20T12:00:00Z",
"version": "0.5.0",
"source": {
"uri": "https://example.com/event.mp4",
"duration": 5400.0,
"languages": ["en", "es"]
},
"languages": ["fr", "de"],
"extensions": { ... }
},
"transcript": {
"segments": [
{
"start": 0.0,
"end": 5.0,
"text": "Bonjour à tous.",
"speaker_id": "Speaker1",
"confidence": 0.95,
"language": "fr"
},
{
"start": 5.1,
"end": 10.0,
"text": "Willkommen alle zusammen.",
"speaker_id": "Speaker2",
"confidence": 0.94,
"language": "de"
}
// More segments...
]
}
In this example:
- The source media languages are English (
"en") and Spanish ("es"). - The transcription languages are French (
"fr") and German ("de"). - Each segment indicates the language of the transcribed text.
Optional vs. Mandatory Fields Summary
- Mandatory Fields:
metadata.transcriber.namemetadata.transcriber.versionmetadata.created_atmetadata.versiontranscript.segments(array)transcript.segments[].starttranscript.segments[].end-
transcript.segments[].text -
Optional Fields:
- All other fields, including
speakers,styles,speaker_id,confidence,language,style_id,words,word_timing_mode, etc.
Field Definitions and Constraints
Time Format Requirements
All time values in the STJ format (start and end fields) must follow these requirements:
Format Specifications
- Must be represented as non-negative decimal numbers
- Must have a precision of up to 3 decimal places (millisecond precision)
- Must not exceed 6 significant digits before the decimal point
- Values must be in the range [0.000, 999999.999]
- Leading zeros before the decimal point are allowed but not required
- Trailing zeros after the decimal point are allowed but not required
- The decimal point must be present if there are decimal places
- Scientific notation is not allowed
Basic Constraints
- For any segment or word:
startmust not be greater thanend- Both
startandendmust be present and valid according to format specifications - For zero-duration items (
startequalsend): - Must include
is_zero_duration:true - For segments:
- Must not contain a
wordsarray - Must not specify a
word_timing_mode
- Must not contain a
- The
is_zero_durationfield: - Must be
trueif and only ifstartequalsend - Must be
falseor omitted for items wherestartdoes not equalend - Must not be included with value
falsewhenstartequalsend
Examples of Valid Time Values
0(zero seconds)0.0(zero seconds)0.000(zero seconds with full precision)1.5(one and a half seconds)10.100(ten and one hundred milliseconds)999999.999(maximum allowed value)
Examples of Invalid Time Values
-1.0(negative values not allowed)1.5e3(scientific notation not allowed)1000000.0(exceeds maximum value)1.2345(exceeds maximum precision)1,5(incorrect decimal separator)
Character Encoding Requirements
Basic Requirements
- Files MUST be encoded in UTF-8
- The UTF-8 Byte Order Mark (BOM) is optional
- JSON string values MUST follow RFC 8259 encoding rules
- The full Unicode character set MUST be supported
String Content Requirements
- All string values MUST:
- Be valid UTF-8 encoded text
- Properly escape control characters (U+0000 through U+001F)
- Properly handle surrogate pairs for supplementary plane characters
- Forward slash (
/) characters MAY be escaped but escaping is not required - Unicode normalization:
- All string values SHOULD be normalized to Unicode Normalization Form C (NFC)
- Applications MUST preserve the normalization form of input text
- Applications MAY normalize text for comparison or search operations
Confidence Scores
Confidence scores are floating-point numbers between 0.0 (no confidence) and 1.0 (full confidence). They are optional but recommended.
Language Codes
ISO 639-1 (two-letter codes) is the primary standard and MUST be used when the language has an ISO 639-1 code.
- Example: Use "en" for English, "fr" for French, "es" for Spanish
ISO 639-3 (three-letter codes) MUST only be used for languages that do not have an ISO 639-1 code.
- Example: Use "yue" for Cantonese (no ISO 639-1 code), but use "zh" for Mandarin Chinese (has ISO 639-1 code)
Consistency Requirements
- A single STJ file MUST NOT mix ISO 639-1 and ISO 639-3 codes for the same language
- All references to a specific language within a file MUST use the same code consistently
- When a language has both ISO 639-1 and ISO 639-3 codes, the ISO 639-1 code MUST be used
Application Requirements
Applications MUST:
- Process both ISO 639-1 and ISO 639-3 codes
- Validate that:
- ISO 639-1 codes are used when available
- ISO 639-3 codes are only used for languages without ISO 639-1 codes
- Language codes are used consistently throughout the file
- Reject files that:
- Use ISO 639-3 codes for languages that have ISO 639-1 codes
- Mix different standards for the same language
- Contain invalid language codes
URI Format Requirements
Purpose
Defines the format and constraints for the uri field in the metadata.source object.
Format Specifications
- Type: String representing a Uniform Resource Identifier (URI) as defined in RFC 3986.
- Allowed Schemes:
- Recommended:
httphttpsfile
- Optional:
- Other schemes (e.g.,
ftp,s3,rtsp) MAY be used if appropriate.
- Other schemes (e.g.,
- Absolute URIs:
- The
uriSHOULD be an absolute URI, including the scheme component. - Examples:
"http://example.com/media/video.mp4""https://example.com/media/audio.mp3""file:///C:/Media/video.mp4"(Windows)"file:///home/user/media/audio.mp3"(Unix-like systems)
- Relative URIs:
- Relative URIs or file paths SHOULD NOT be used.
- If a relative URI is provided, consuming applications MUST resolve it relative to a known base URI.
- Note: Relative URIs can lead to ambiguity and are discouraged.
Validation Rules
- The
uriMUST conform to the syntax defined in RFC 3986. - Implementations SHOULD validate the URI format and report errors if invalid.
- Scheme Support:
- Required Support:
- Implementations MUST support
httpandhttpsschemes.
- Implementations MUST support
- Optional Support:
- Support for other schemes is OPTIONAL and may vary between implementations.
Security Considerations
- Privacy:
- Be cautious when including URIs that may reveal sensitive information, such as local file paths or internal network addresses.
- Consider omitting the
urior sanitizing it if privacy is a concern. - Security Risks:
- Applications consuming STJ files SHOULD handle URIs carefully to avoid security risks such as directory traversal or accessing unauthorized resources.
Examples
- HTTP URI:
json
"uri": "http://example.com/media/video.mp4"
- HTTPS URI:
json
"uri": "https://example.com/media/audio.mp3"
- File URI (Windows Path):
json
"uri": "file:///C:/Media/video.mp4"
- File URI (Unix Path):
json
"uri": "file:///home/user/media/audio.mp3"
- S3 URI (Optional Scheme):
json
"uri": "s3://bucket-name/path/to/object"
Speaker IDs
Format Specifications
- Type: String
- Allowed Characters: Letters (A-Z, a-z), digits (0-9), underscores (_), and hyphens (-).
- Length Constraints:
- Minimum length: 1 character
- Maximum length: 64 characters
- Uniqueness:
- Speaker IDs MUST be unique within the
speakerslist. speaker_idreferences in segments MUST match anidin thespeakerslist.- Case Sensitivity:
- Speaker IDs are case-sensitive;
Speaker1andspeaker1are considered different IDs. - Format Recommendations:
- Use meaningful identifiers when possible, e.g.,
Speaker_JohnDoe. - For anonymous speakers, use generic IDs like
Speaker1,Speaker2, etc.
Representing Anonymous Speakers
- When the Speaker is Unknown or Anonymous:
- Use a consistent placeholder ID, such as
Speaker1,Speaker2, etc. - The
namefield MAY be omitted or set to a placeholder like"Unknown"or"Anonymous". - Consistency:
- Maintain consistent IDs for anonymous speakers throughout the transcript to differentiate between different speakers.
- If speaker diarization is uncertain, it is acceptable to assign the same
speaker_idto multiple segments where the speaker is believed to be the same.
Examples
- Known Speaker:
json
{
"id": "Speaker_JohnDoe",
"name": "John Doe"
}
- Anonymous Speaker:
json
{
"id": "Speaker1",
"name": "Unknown"
}
- Multiple Anonymous Speakers:
json
"speakers": [
{ "id": "Speaker1", "name": "Unknown" },
{ "id": "Speaker2", "name": "Unknown" },
{ "id": "Speaker3", "name": "Unknown" }
]
Validation Rules
- ID Format Validation:
- IDs MUST only contain allowed characters.
- IDs MUST meet the length constraints.
- Uniqueness Validation:
- IDs in the
speakerslist MUST be unique. - Duplicate IDs MUST result in a validation error.
- Reference Validation:
- All
speaker_idreferences in segments MUST match anidin thespeakerslist. - Invalid references MUST result in a validation error.
Implementation Notes
- Applications SHOULD provide meaningful error messages when validation fails due to speaker ID issues.
- When Generating STJ Files:
- Ensure that speaker IDs conform to the specified format requirements.
- Assign consistent IDs to anonymous speakers to maintain differentiation.
Style IDs
If style_id is used, it must match an id in the styles list.
Text Fields
text fields should be in plain text format. Special formatting or markup should be handled via the styles mechanism.
Word Timing Mode Field
Purpose
Indicates the completeness of word-level timing data within the segment.
Allowed Values
"complete": All words in the segment have timing data"partial": Only some words have timing data"none": No word-level timing data is provided
Default Behavior
- When omitted and
wordsarray is present with complete coverage: treated as"complete" - When omitted and
wordsarray is absent: treated as"none" - When omitted and
wordsarray is present but incomplete: invalid - must explicitly specify"partial"
Constraints
- For
"complete": All words must have timing data - For
"partial": Some words must have timing data - For
"none": Must not includewordsarray
Extensions Field Requirements
Purpose
The extensions field allows for the inclusion of custom, application-specific metadata and format-specific properties without affecting compatibility with other implementations.
Structure
- The
extensionsfield, if present, MUST be a JSON object. - Each key in
extensionsMUST represent a namespace and MUST be a non-empty string. - The value corresponding to each namespace MUST be a JSON object containing key-value pairs specific to that namespace.
Namespaces
- Namespace Naming:
- Namespaces SHOULD be concise and reflect the application, format, or organization.
- Examples include
myapp,companyname,customformat. - Reserved Namespaces:
- The following namespaces are RESERVED for future use by the STJ specification and MUST NOT be used for custom data:
stj(reserved for STJ specification extensions)webvtt(reserved for WebVTT format mappings)ttml(reserved for TTML format mappings)ssa(reserved for SSA/ASS format mappings)srt(reserved for SubRip format mappings)dfxp(reserved for DFXP/Timed Text format mappings)smptett(reserved for SMPTE-TT format mappings)
- Developer Guidance:
- Developers who need to include format-specific properties before official definitions are available may:
- Use a custom namespace that clearly indicates its provisional nature, such as
custom_webvttorexperimental_ttml. - Be prepared to migrate their data to the official namespace once the STJ specification provides the definitions.
- Use a custom namespace that clearly indicates its provisional nature, such as
Constraints
- Applications MUST ignore any namespaces in
extensionsthat they do not recognize. - The
extensionsfield SHOULD NOT include essential data required for basic functionality. - Nested objects and arrays ARE ALLOWED within each namespace.
- Keys within namespaces MUST NOT duplicate or conflict with standard fields of the containing object.
- Reserved Namespaces Validation:
- Namespaces listed as RESERVED in the specification MUST NOT be used by applications for custom data.
- Applications MUST report an error if a reserved namespace is used.
Examples
- In a
segmentobject:
json
"extensions": {
"myapp": {
"custom_property": "value",
"analysis_data": {
"sentiment_score": 0.85,
"keywords": ["innovation", "technology"]
}
},
"analytics": {
"emotion": "happy",
"confidence": 0.9
}
}
- In a
styleobject with format-specific properties:
json
{
"id": "caption_style",
"text": {
"color": "#FFFFFF",
"background": "#000000"
},
"display": {
"align": "center",
"vertical": "bottom"
},
"extensions": {
"custom_webvtt": {
"line": "auto",
"position": "50%",
"size": "100%"
},
"myapp": {
"custom_style_property": "value"
}
}
}
- In a
metadataobject:
json
"extensions": {
"project_info": {
"project": "International Conference",
"client": "Global Events Inc."
},
"notes": {
"review_status": "approved",
"reviewer": "John Doe"
}
}
Note: Standard fields defined in the STJ specification MUST NOT be duplicated within any namespace in extensions. For example, including a key "start" within a namespace is prohibited if it conflicts with the mandatory "start" field of the segment.
Implementation Requirements
Time Value Processing
Implementations MUST:
- Parse time values with up to 3 decimal places
- Preserve the precision of input values up to 3 decimal places
- Round any input with more than 3 decimal places to 3 decimal places using IEEE 754 round-to-nearest-even
- Validate all time values according to the Time Format Requirements section
Implementations MUST reject files that contain any of the following:
- Negative time values
- Values exceeding 999999.999 seconds
- Time values using scientific notation
- Overlapping segments
Validation Requirements
Segment-Level Validation
- Required Fields:
startandendtimes MUST conform to the Time Format Requirements sectiontextMUST be present and non-empty- References:
speaker_id, if present, MUST match anidin thespeakersliststyle_id, if present, MUST match anidin thestyleslist- Segment Ordering:
- Segments MUST be ordered by their
starttimes in ascending order - For segments with identical start times, they MUST be ordered by their end times in ascending order
- Segment Overlap:
- Segments MUST NOT overlap in time
- For any two segments S1 and S2 where S1 appears before S2 in the segments array:
- S1.end MUST be less than or equal to S2.start
- Examples of valid segment ordering:
- Adjacent segments: S1(0.0, 1.0), S2(1.0, 2.0)
- Segments with gap: S1(0.0, 1.0), S2(2.0, 3.0)
- Examples of invalid segment ordering:
- Overlapping segments: S1(0.0, 2.0), S2(1.0, 3.0)
- Out of order segments: S1(1.0, 2.0), S2(0.0, 3.0)
- Zero-Duration Segments:
- MUST follow the zero-duration requirements defined in the Time Format Requirements section
- Zero-duration segments MAY share the same timestamp
Word-Level Validation
- When
wordsarray is present: - Each word object must have
text,start, andend - All time values must conform to the Time Format Requirements section
- Word timing constraints:
- Word times must be within the parent segment's time range
- Words must be ordered by
starttime - Word timings must not overlap
- Word Timing Mode Requirements:
- When
"complete"(or omitted with complete coverage):- The concatenation of all
textfields inwordsmust match the segment'stext, except for differences in whitespace or punctuation
- The concatenation of all
- When
"partial":- The
textfields inwordsmust be a subset of the words in the segment'stext, in the same order
- The
- When
"none"(or omitted withoutwordsarray):- Must not contain
wordsarray
- Must not contain
- Zero-Duration Words:
- Must follow the zero-duration requirements defined in the Time Format Requirements section
General Validation
- URI Validation Requirements:
- The
urifield inmetadata.sourceMUST conform to the URI Format Requirements specified in the Field Definitions and Constraints section. - Implementations SHOULD validate the URI format according to RFC 3986.
- Invalid URIs SHOULD result in a validation error or warning.
- Scheme Support:
- Implementations MUST support
httpandhttpsschemes. - Support for other schemes is OPTIONAL.
- Implementations MUST support
-
Relative URIs:
- Relative URIs SHOULD NOT be used.
- If present, they MUST be resolved relative to a known base URI by the consuming application.
-
Language Code Requirements:
- All language codes must be valid ISO 639 codes
-
Language codes must be consistent with the requirements defined in the Language Codes section
-
Confidence Score Requirements:
-
Confidence scores, if present, must be within the range [0.0, 1.0]
-
Reference Requirements:
- Speaker IDs:
- All
speaker_idreferences in segments MUST correspond to valididentries in thespeakerslist. - All
idvalues in thespeakerslist MUST conform to the Speaker ID Requirements specified in the Field Definitions and Constraints section. - All IDs in the
speakersarray MUST be unique. - IDs MUST only contain allowed characters and meet length constraints.
- All
-
Style IDs:
- All
style_idreferences must correspond to valid entries in thestyleslist. - All IDs in the
stylesarray MUST be unique.
- All
-
Character Encoding Requirements:
- All text content must be valid UTF-8
- All JSON string values must follow RFC 8259 encoding rules
- Control characters must be properly escaped
- Surrogate pairs must be properly formed
- BOM must be handled correctly if present
Extensions Field Validation
- Structure Validation:
- The
extensionsfield, if present, MUST be a JSON object. - Namespaces MUST be strings and MUST NOT be empty.
-
Values corresponding to namespaces MUST be JSON objects.
-
Reserved Namespaces Validation:
- Namespaces listed as RESERVED in the specification MUST NOT be used by applications for custom data.
-
Applications MUST report an error if a reserved namespace is used.
-
Content Validation:
- Applications MUST ignore any namespaces or keys within
extensionsthat they do not recognize. -
Values within namespaces MAY be validated based on application-specific requirements.
-
Conflict Resolution:
- If a key within a namespace in
extensionsconflicts with a standard field, the standard field's value MUST take precedence. - Applications MUST report an error if a conflict is detected.
Style Processing
Implementations:
- MAY support none, some, or all style properties
- MUST ignore style properties they don't support
- MUST document which style properties they support
- SHOULD provide reasonable fallback behavior for unsupported properties
When converting STJ to other formats, implementations:
- SHOULD document how STJ style properties map to the target format's capabilities
- MAY omit style properties that cannot be represented in the target format
Advanced styling features (such as animations, karaoke effects, or complex positioning) SHOULD be implemented using format-specific properties within the extensions field under a namespace corresponding to the format.
Representing Confidence
Confidence scores provide an indication of the reliability of the transcribed text. They can be used to:
- Highlight low-confidence segments for manual review.
- Filter out words or segments below a certain confidence threshold.
- Provide visual cues in transcription editors or viewers.
Including both segment-level and word-level confidence scores allows applications to present detailed insights into transcription accuracy.
Comparison with Existing Formats
The STJ format is designed to be a superset of common transcription and subtitle formats, incorporating their features and extending them where necessary.
SRT (SubRip)
- Sequence Numbers: Not used in STJ, as sequence is implied by array order.
- Timestamps: STJ uses precise start and end times in seconds.
- Text: Supported via
textfield. - Styling: Limited in SRT; STJ supports styling via
stylesandstyle_id.
WebVTT
- Text Formatting: Basic formatting supported via core style properties
- Positioning: Basic positioning supported via display properties
- Advanced Features: Can be represented via properties within the
custom_webvttnamespace inextensions
TTML (Timed Text Markup Language)
- Basic Styling: Supported via core style properties
- Advanced Features: Can be represented via properties within the
custom_ttmlnamespace inextensions - Multiple Languages: Supported via segment-level language codes
SSA/ASS
- Basic Styling: Supported via core style properties
- Advanced Features: Can be represented via properties within the
custom_ssanamespace inextensions
Usage in Applications
The STJ format is designed to be easily parsed and utilized by a variety of applications, such as:
- Transcription Editors: Tools can load STJ files to display transcriptions with speaker labels, timestamps, and styling.
- Subtitle Generators: Applications can convert STJ segments into subtitle formats like SRT or WebVTT.
- Speech Analytics: Analyze transcriptions for sentiment, keyword extraction, or topic modeling.
- Quality Assurance: Reviewers can focus on low-confidence segments for correction.
- Multilingual Support: Applications can handle multilingual transcriptions by leveraging per-segment language data.
Extensibility and Customization
- Additional Metadata: Use the
extensionsfields in bothmetadataand individual objects to include custom data without affecting compatibility. - Versioning: Include a
versionfield inmetadataif needed for future format updates. - Custom Fields: Applications can add custom data within appropriately named namespaces in the
extensionsfield to include application-specific data without affecting compatibility.
Adherence to Best Practices
The STJ format follows best practices for data interchange formats, drawing inspiration from established standards like:
- IETF RFC 8259: The STJ format adheres to the JSON standard as specified in RFC 8259.
- ISO 639 Language Codes: Uses standard language codes to ensure consistency.
- Dublin Core Metadata Initiative (DCMI): The metadata fields are designed to align with DCMI principles where applicable.
- Naming Conventions: Field names are concise and use lowercase letters with underscores for readability.
- Extensibility: The format allows for future expansion without breaking existing implementations.
Final Remarks
The STJ format aims to be a comprehensive and flexible standard for transcription data representation. By incorporating features from existing formats and adhering to best practices, it strives to meet the needs of a wide range of applications and facilitate better interoperability in the field of speech transcription and subtitles.
Note: This specification is open for suggestions and improvements. Contributions from the community are welcome to refine and enhance the STJ format.
Contact: For feedback or contributions, please reach out via The STJ Repository.