Best Practices for Using STJ Format
Introduction
The Standard Transcription JSON (STJ) format is designed to be flexible and comprehensive. Following best practices ensures consistency, interoperability, and ease of use across different applications and services.
General Guidelines
Consistent Field Usage
- Mandatory Fields: Always include all mandatory fields as specified in the STJ specification.
- Optional Fields: Use optional fields where appropriate to enhance the richness of the data.
Data Types and Formats
- Timestamps: Represent all time fields (
start,end) in seconds as floating-point numbers. - Language Codes: Use standard ISO 639-1 or ISO 639-3 language codes.
- Creation Timestamp: Use the ISO 8601 format for the
created_atfield (e.g.,"2023-10-20T12:00:00Z").
Naming Conventions
- Field Names: Use lowercase letters with underscores (
_) for multi-word field names. - Identifiers: Use unique strings for
idfields inspeakersandstyles.
Handling Languages
Distinction Between Source and Transcription Languages
- Source Languages (
metadata.source.languages): - Represent the languages present in the original media.
-
Useful for understanding the content and planning transcription or translation processes.
-
Transcription Languages (
metadata.languages): - Represent the languages included in the transcription.
- May differ from source languages if translations are included.
Per-Segment Language Annotation
- Specify Language per Segment:
- Use the
languagefield in each segment to indicate the language of the transcribed text. - Crucial for multilingual transcriptions and translations.
Examples
- Transcription with Translation:
- Source media contains English and Spanish.
- Transcription includes translations into French and German.
metadata.source.languages:["en", "es"]metadata.languages:["fr", "de"]
Transcription Segments
Time Alignment
- Non-Overlapping Segments: Ensure that segments do not overlap in time.
- Sequential Order: Segments should be ordered sequentially based on their
starttimes.
Speaker Identification
- Consistent Speaker IDs: Use consistent
speaker_idvalues throughout the transcript. - Unknown Speakers: If the speaker is unknown, omit the
speaker_idfield or use a placeholder like"speaker_id": "Unknown".
Language Annotation
- Per-Segment Language: Specify the
languagefield for each segment in multilingual transcripts. - Default Language: If the entire transcript is in a single language, specifying
languagein each segment is optional.
Words Array
- Detailed Word Data: Include the
wordsarray in segments when word-level timing or confidence is needed. - Minimal Segments: If word-level details are not required, omit the
wordsarray to reduce file size.
Handling word_timing_mode
- Complete Word Data:
- When
word_timing_modeis"complete", ensure that all words in thetextfield are represented in thewordsarray. - Partial Word Data:
- When
word_timing_modeis"partial", include as many words as have timing data, and ensure they are in the correct order. - No Word Data:
- When there is no word-level timing data, you may omit the
wordsarray or setword_timing_modeto"none".
Styling
- Use of Styles: Define styles in the
stylesarray and reference them in segments usingstyle_id. - Default Styles: If no styling is needed, omit the
stylessection.
Confidence Scores
- Including Confidence: Provide
confidencescores for segments and words when available. - Interpreting Confidence: Use the
confidence_thresholdfrom metadata to determine if segments need review.
Additional Information
- Custom Data: Use the
additional_infofield to include application-specific data. - Namespacing: For custom fields, use a prefix (e.g.,
x_custom_field) to avoid conflicts with future specification updates.
Metadata
Accurate Source Information
- Source URI: Provide the
uriinmetadata.sourceif the source media is accessible. - Duration: Ensure
durationmatches the actual length of the media.
Creation Timestamp
- ISO 8601 Format: Use the ISO 8601 format for the
created_atfield (e.g.,"2023-10-20T12:00:00Z").
Validation
- Schema Validation: Validate STJ files against the provided JSON schema to ensure compliance.
- Automated Testing: Incorporate validation into automated testing workflows.
Interoperability
- Compatibility: When converting to other formats (e.g., SRT, VTT), ensure that essential information is preserved.
- Extensions: Avoid breaking changes when extending the format for specific needs.
Performance Considerations
- File Size: Be mindful of file size, especially when including detailed word-level data.
- Efficient Parsing: Structure data to facilitate efficient parsing and processing by applications.
Security
- Data Sanitization: Sanitize any user-generated content to prevent injection attacks.
- Sensitive Information: Avoid including sensitive personal information in the
additional_infoor other fields unless necessary and secured.
Conclusion
By adhering to these best practices, you can ensure that your use of the STJ format is consistent, reliable, and compatible with a wide range of tools and applications.