Subtitles
Here is a list of pointers for storing subtitles in Matroska:
-
As a general rule of thumb for all codecs, information that is global to an entire stream SHOULD be stored in the
CodecPrivate
element. -
As subtitles usually come with a start and stop timestamps or a start timestamp and a duration,
SimpleBlock
is usually not used as it doesn’t allow storing theBlockDuration
. -
Start and stop timestamps that are used in a timestamps original storage format SHOULD be removed when being placed in Matroska as they could interfere if the file is edited afterwards. Instead, the
Block
’s timestamp andBlockDuration
SHOULD be used to say when the timestamp is displayed. -
Because a “subtitle” stream is actually just an overlay stream, anything with a transparency layer could be use, including video.
Images Subtitles
The first image format that is a goal to import into Matroska is the VobSub subtitle format. This subtitle type is generated by exporting the subtitles from a DVD [@?DVD-Video].
The requirement for muxing VobSub into Matroska is v7 subtitles (see first line of the .IDX file). If the version is smaller, you must remux them using the SubResync utility from VobSub 2.23 (or MPC) into v7 format. Generally any newly created subs will be in v7 format.
The .IFO file will not be used at all.
If there is more than one subtitle stream in the VobSub set, each stream will need to be separated into separate tracks for storage in Matroska. E.g. the VobSub file contains streams for both English and German subtitles. Then the resulting Matroska file SHOULD contain two tracks. That way the language information can be dropped and mapped to Matroska’s language tags.
The .IDX file is reformatted (see below) and placed in the CodecPrivate
.
Each .BMP will be stored in its own Block. The Timestamp will be stored in the Block
timestamp
and the duration will be stored in the Default Duration.
Here is an example .IDX file:
# VobSub index file, v7 (do not modify this line!)
#
# To repair desynchronization, you can insert gaps this way:
# (it usually happens after vob id changes)
#
# delay: [sign]hh:mm:ss:ms
#
# Where:
# [sign]: +, - (optional)
# hh: hours (0 <= hh)
# mm/ss: minutes/seconds (0 <= mm/ss <= 59)
# ms: milliseconds (0 <= ms <= 999)
#
# Note: You can't position a sub before the previous with a negative
# value.
#
# You can also modify timestamps or delete a few subs you don't
# like. Just make sure they stay in increasing order.
# Settings
# Original frame size
size: 720x480
# Origin, relative to the upper-left corner, can be overloaded by
# alignment
org: 0, 0
# Image scaling (hor,ver), origin is at the upper-left corner or at
# the alignment coord (x, y)
scale: 100%, 100%
# Alpha blending
alpha: 100%
# Smoothing for very blocky images (use OLD for no filtering)
smooth: OFF
# In millisecs
fadein/out: 50, 50
# Force subtitle placement relative to (org.x, org.y)
align: OFF at LEFT TOP
# For correcting non-progressive desync. (in millisecs or
# hh:mm:ss:ms)
# Note: Not effective in DirectVobSub, use "delay: ... " instead.
time offset: 0
# ON: displays only forced subtitles, OFF: shows everything
forced subs: OFF
# The original palette of the DVD
palette: 000000, 7e7e7e, fbff8b, cb86f1, 7f74b8, e23f06, 0a48ea, \
b3d65a, 6b92f1, 87f087, c02081, f8d0f4, e3c411, 382201, e8840b, \
fdfdfd
# Custom colors (transp idxs and the four colors)
custom colors: OFF, tridx: 0000, colors: 000000, 000000, 000000, \
000000
# Language index in use
langidx: 0
# English
id: en, index: 0
# Uncomment next line to activate alternative name in DirectVobSub /
# Windows Media Player 6.x
# alt: English
# Vob/Cell ID: 1, 1 (PTS: 0)
timestamp: 00:00:01:101, filepos: 000000000
timestamp: 00:00:08:708, filepos: 000001000
First, lines beginning with “#” are removed. These are comments to make text file editing easier, and as this is not a text file, they aren’t needed.
Next remove the “langidx” and “id” lines. These are used to differentiate the subtitle streams and define the language. As the streams will be stored separately anyway, there is no need to differentiate them here. Also, the language setting will be stored in the Matroska tags, so there is no need to store it here.
Finally, the “timestamp” will be used to set the Block
’s timestamp. Once it is set there,
there is no need for it to be stored here. Also, as it may interfere if the file is edited,
it SHOULD NOT be stored here.
Once all of these items are removed, the data to store in the CodecPrivate
SHOULD look like this:
size: 720x480
org: 0, 0
scale: 100%, 100%
alpha: 100%
smooth: OFF
fadein/out: 50, 50
align: OFF at LEFT TOP
time offset: 0
forced subs: OFF
palette: 000000, 7e7e7e, fbff8b, cb86f1, 7f74b8, e23f06, 0a48ea, \
b3d65a, 6b92f1, 87f087, c02081, f8d0f4, e3c411, 382201, e8840b, \
fdfdfd
custom colors: OFF, tridx: 0000, colors: 000000, 000000, 000000, \
000000
There SHOULD also be two Blocks containing one image each with the timestamps “00:00:01:101” and “00:00:08:708”.
SRT Subtitles
SRT is perhaps the most basic of all subtitle formats.
It consists of four parts, all in text:
-
A number indicating which subtitle it is in the sequence.
-
The time that the subtitle appears on the screen, and then disappears.
-
The subtitle itself.
-
A blank line indicating the start of a new subtitle.
When placing SRT in Matroska, part 3 is converted to UTF-8 (S_TEXT/UTF8) and placed
in the data portion of the Block. Part 2 is used to set the timestamp of the Block,
and BlockDuration
element. Nothing else is used.
Here is an example SRT file:
1
00:02:17,440 --> 00:02:20,375
Senator, we're making
our final approach into Coruscant.
2
00:02:20,476 --> 00:02:22,501
Very good, Lieutenant.
In this example, the text “Senator, we’re making our final approach into Coruscant.”
would be converted into UTF-8 and placed in the Block. The timestamp of the block would
be set to “00:02:17,440”. And the BlockDuration
element would be set to “00:00:02,935”.
The same is repeated for the next subtitle.
Because there are no general settings for SRT, the CodecPrivate
is left blank.
SSA/ASS Subtitles
SSA stands for Sub Station Alpha. It’s the file format used by the popular subtitle editor SubStation Alpha. It allows you to do some advanced display features, like positioning, karaoke, style managements…
For detailed information on SSA/ASS, see the SSA specs [@!SSA]. It includes an SSA specs description and the advanced features added by ASS format (standing for Advanced SSA). Because SSA and ASS are so similar, they are treated the same here.
Like SRT, this format is text based with a particular syntax.
A file consists of 4 or 5 parts, declared ala INI file (but it’s not an INI !)
The first, “[Script Info]” contains some information about the subtitle file, such as it’s title, who created it, type of script and a very important one: “PlayResY”. Be careful of this value, everything in your script (font size, positioning) is scaled by it. Sub Station Alpha uses your desktops Y resolution to write this value, so if a friend with a large monitor and a high screen resolution gives you an edited script, you can mess everything up by saving the script in SSA with your low-cost monitor.
The second, “[V4 Styles]” or “[V4+ Styles]”, is a list of style definitions. A style describes how a text will look on the screen. It defines font, font size, primary/…/outile colour, position, alignment, etc.
For example, this:
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, \
TertiaryColour, BackColour, Bold, Italic, BorderStyle, Outline, \
Shadow, Alignment, MarginL, MarginR, MarginV, AlphaLevel, Encoding
Style: Wolf main,Wolf_Rain,56,15724527,15724527,15724527,4144959,0,\
0,1,1,2,2,5,5,30,0,0
The third, “[Events]”, is the list of text you want to display at the right timing. You can specify some attribute here. Like the style to use for this event (MUSTbe defined in the list), the position of the text (Left, Right, Vertical Margin), an effect. Name is mostly used by translator to know who said this sentence. Timing is in h:mm:ss.cc (centisec).
Format: Marked, Start, End, Style, Name, MarginL, MarginR, MarginV, \
Effect, Text
Dialogue: Marked=0,0:02:40.65,0:02:41.79,Wolf main,Cher,0000,0000,\
0000,,Et les enregistrements de ses ondes delta ?
Dialogue: Marked=0,0:02:42.42,0:02:44.15,Wolf main,autre,0000,0000,\
0000,,Toujours rien.
“[Pictures]” or “[Fonts]” part can be found in some SSA file, they contains UUE-encoded pictures/font but those features are only used by Sub Station Alpha – i.e., no filter (Vobsub/Avery Lee Subtiler filter) use them.
Now, how are they stored in Matroska?
-
All text is converted to UTF-8
-
All the headers, “[Script Info]” and the “[V4 Styles]”/”[V4+ Styles]” list, are stored in
CodecPrivate
. -
Start & End field are used to set TimeStamp and the
BlockDuration
element. the data stored is: -
Events are stored in the Block in this order: ReadOrder, Layer, Style, Name, MarginL, MarginR, MarginV, Effect, Text (Layer comes from ASS specs … it’s empty for SSA.) “ReadOrder field is needed for the decoder to be able to reorder the streamed samples as they were placed originally in the file.”
Here is an example of an SSA file.
[Script Info]
; This is a Sub Station Alpha v4 script.
Title: Wolf's rain 2
Original Script: Anime-spirit Ishin-francais
Original Translation: Coolman
Original Editing: Spikewolfwood
Original Timing: Lord_alucard
Original Script Checking: Spikewolfwood
ScriptType: v4.00
Collisions: Normal
PlayResY: 1024
PlayDepth: 0
Wav: 0, 128697,D:\Alex\Anime\- Fansub -\- TAFF -\WR_-_02_Wav.wav
Wav: 0, 120692,H:\team truc\WR_-_02.wav
Wav: 0, 116504,E:\sub\wolf's_rain\WOLF'S RAIN 02.wav
LastWav: 3
Timer: 100,0000
[V4 Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, \
TertiaryColour, BackColour, Bold, Italic, BorderStyle, Outline, \
Shadow, Alignment, MarginL, MarginR, MarginV, AlphaLevel, Encoding
Style: Default,Arial,20,65535,65535,65535,-2147483640,-1,0,1,3,0,2,\
30,30,30,0,0
Style: Titre_episode,Akbar,140,15724527,65535,65535,986895,-1,0,1,1,\
0,3,30,30,30,0,0
Style: Wolf main,Wolf_Rain,56,15724527,15724527,15724527,4144959,0,\
0,1,1,2,2,5,5,30,0,0
[Events]
Format: Marked, Start, End, Style, Name, MarginL, MarginR, MarginV, \
Effect, Text
Dialogue: Marked=0,0:02:40.65,0:02:41.79,Wolf main,Cher,0000,0000,\
0000,,Et les enregistrements de ses ondes delta ?
Dialogue: Marked=0,0:02:42.42,0:02:44.15,Wolf main,autre,0000,0000,\
0000,,Toujours rien.
Here is what would be placed into the CodecPrivate
element.
[Script Info]
; This is a Sub Station Alpha v4 script.
Title: Wolf's rain 2
Original Script: Anime-spirit Ishin-francais
Original Translation: Coolman
Original Editing: Spikewolfwood
Original Timing: Lord_alucard
Original Script Checking: Spikewolfwood
ScriptType: v4.00
Collisions: Normal
PlayResY: 1024
PlayDepth: 0
Wav: 0, 128697,D:\Alex\Anime\- Fansub -\- TAFF -\WR_-_02_Wav.wav
Wav: 0, 120692,H:\team truc\WR_-_02.wav
Wav: 0, 116504,E:\sub\wolf's_rain\WOLF'S RAIN 02.wav
LastWav: 3
Timer: 100,0000
[V4 Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, \
TertiaryColour, BackColour, Bold, Italic, BorderStyle, Outline, \
Shadow, Alignment, MarginL, MarginR, MarginV, AlphaLevel, Encoding
Style: Default,Arial,20,65535,65535,65535,-2147483640,-1,0,1,3,0,2,\
30,30,30,0,0
Style: Titre_episode,Akbar,140,15724527,65535,65535,986895,-1,0,1,1,\
0,3,30,30,30,0,0
Style: Wolf main,Wolf_Rain,56,15724527,15724527,15724527,4144959,0,\
0,1,1,2,2,5,5,30,0,0
And here are the two blocks that would be generated.
Block
’s timestamp: 00:02:40.650
BlockDuration
: 00:00:01.140
1,,Wolf main,Cher,0000,0000,0000,,Et les enregistrements de ses \
ondes delta ?
Block
’s timestamp: 00:02:42.420
BlockDuration
: 00:00:01.730
2,,Wolf main,autre,0000,0000,0000,,Toujours rien.
WebVTT
The “Web Video Text Tracks Format” (short: WebVTT) is developed by the World Wide Web Consortium (W3C). Its specifications are freely available at [@!WebVTT].
The guiding principles for the storage of WebVTT in Matroska are:
-
Consistency: store data in a similar way to other subtitle codecs
-
Simplicity: making decoding and remuxing as easy as possible for existing infrastructures
-
Completeness: keeping as much data as possible from the original WebVTT file
Track Parameters
The CodecID
to use is S_TEXT/WEBVTT
.
This CodecPrivate
contains all global blocks before the first subtitle entry. This starts at the “WEBVTT
”
file identification marker but excludes the optional byte order mark.
Storage of non-global WebVTT blocks
Non-global WebVTT blocks (e.g., “NOTE”) before a WebVTT Cue Text are stored in Matroska’s BlockAddition element together with the Matroska Block containing the WebVTT Cue Text these blocks precede (see below for the actual format).
Storage of Cues in Matroska blocks
Each WebVTT Cue Text is stored directly in the Matroska Block.
A muxer MUST change all WebVTT Cue Timestamps present within the Cue Text to be relative
to the Matroska Block
’s timestamp.
The Cue’s start timestamp is used as the Matroska Block
’s timestamp.
The difference between the Cue’s end timestamp and its start timestamp is used as
the Matroska BlockDuration
.
BlockAdditions: storing non-global WebVTT blocks, Cue Settings Lists and Cue identifiers
Each Matroska Block may be accompanied by one BlockAdditions
element. Its format is as follows:
-
The first line contains the WebVTT Cue Text’s optional Cue Settings List followed by one line feed character (U+0x000a). The Cue Settings List may be empty, in which case the line consists of the line feed character only.
-
The second line contains the WebVTT Cue Text’s optional Cue Identifier followed by one line feed character (U+0x000a). The line may be empty indicating that there was no Cue Identifier in the source file, in which case the line consists of the line feed character only.
-
The third and all following lines contain all WebVTT Comment Blocks that precede the current WebVTT Cue Block. These may be absent.
If there is no Matroska BlockAddition element stored together with the Matroska Block, then all three components (Cue Settings List, Cue Identifier, Cue Comments) MUST be assumed to be absent.
Examples of transformation
Here’s an example how a WebVTT is transformed.
Example WebVTT file
Let’s take the following example file:
WEBVTT with text after the signature
STYLE
::cue {
background-image: linear-gradient(to bottom, dimgray, lightgray);
color: papayawhip;
}
/* Style blocks cannot use blank lines nor "dash dash greater \
than" */
NOTE comment blocks can be used between style blocks.
STYLE
::cue(b) {
color: peachpuff;
}
REGION
id:bill
width:40%
lines:3
regionanchor:0%,100%
viewportanchor:10%,90%
scroll:up
NOTE
Notes always span a whole block and can cover multiple
lines. Like this one.
An empty line ends the block.
hello
00:00:00.000 --> 00:00:10.000
Example entry 1: Hello <b>world</b>.
NOTE style blocks cannot appear after the first cue.
00:00:25.000 --> 00:00:35.000
Example entry 2: Another entry.
This one has multiple lines.
00:01:03.000 --> 00:01:06.500 position:90% align:right size:35%
Example entry 3: That stuff to the right of the timestamps are cue \
settings.
00:03:10.000 --> 00:03:20.000
Example entry 4: Entries can even include timestamps.
For example:<00:03:15.000>This becomes visible five seconds
after the first part.
Example of CodecPrivate
The resulting CodecPrivate
element will look like this:
WEBVTT with text after the signature
STYLE
::cue {
background-image: linear-gradient(to bottom, dimgray, lightgray);
color: papayawhip;
}
/* Style blocks cannot use blank lines nor "dash dash greater \
than" */
NOTE comment blocks can be used between style blocks.
STYLE
::cue(b) {
color: peachpuff;
}
REGION
id:bill
width:40%
lines:3
regionanchor:0%,100%
viewportanchor:10%,90%
scroll:up
NOTE
Notes always span a whole block and can cover multiple
lines. Like this one.
An empty line ends the block.
Storage of Cue 1
Example Cue 1: timestamp 00:00:00.000, duration 00:00:10.000, Block’s content:
Example entry 1: Hello <b>world</b>.
BlockAddition’s content starts with one empty line as there’s no Cue Settings List:
hello
Storage of Cue 2
Example Cue 2: timestamp 00:00:25.000, duration 00:00:10.000, Block’s content:
Example entry 2: Another entry.
This one has multiple lines.
BlockAddition’s content starts with two empty lines as there’s neither a Cue Settings List nor a Cue Identifier:
NOTE style blocks cannot appear after the first cue.
Storage of Cue 3
Example Cue 3: timestamp 00:01:03.000, duration 00:00:03.500, Block’s content:
Example entry 3: That stuff to the right of the timestamps are cue \
settings.
BlockAddition’s content ends with an empty line as there’s no Cue Identifier and there were no WebVTT Comment blocks:
position:90% align:right size:35%
Storage of Cue 4
Example Cue 4: timestamp 00:03:10.000, duration 00:00:10.000, Block’s content:
Example entry 4: Entries can even include timestamps. For example:<00:00:05.000>This becomes visible five seconds after the first part.
This Block does not need a BlockAddition as the Cue did not contain an Identifier, nor a Settings List, and it wasn’t preceded by Comment blocks.
Storage of WebVTT in Matroska vs. WebM
Note: the storage of WebVTT in Matroska is not the same as the design document for storage of WebVTT in WebM [@?WebM-WebVTT]. There are several reasons for this including but not limited to: the WebM document is old (from February 2012) and was based on an earlier draft of WebVTT and ignores several parts that were added to WebVTT later; WebM does still not support subtitles at all [@?WebMContainer]; the proposal suggests splitting the information across multiple tracks making demuxer’s and remuxer’s life very difficult.
WebM uses the “D_WEBVTT/SUBTITLES”, “D_WEBVTT/CAPTIONS”, “D_WEBVTT/DESCRIPTIONS”, and “D_WEBVTT/METADATA” CodecID
with different tracks depending on the data type and without a CodecPrivate
.
HDMV Presentation Graphics Subtitles
The specifications for the HDMV Presentation Graphics Subtitle format (short: HDMV PGS) can be found in the document “Blu-ray Disc Read-Only Format; Part 3 — Audio Visual Basic Specifications” in section 9.14 “HDMV graphics streams”.
Track Parameters
The CodecID
to use is S_HDMV/PGS
. A CodecPrivate
element is not used.
Matroska Blocks
Each HDMV PGS Segment (short: Segment) will be stored in a Matroska Block. A Segment is the data structure described in section 9.14.2.1 “Segment coding structure and parameters” of the Blu-ray specifications.
Each Segment contains a presentation timestamp. This timestamp will be used as the timestamp for the Matroska Block.
A Segment is normally shown until a subsequent Segment is encountered. Therefore, the Matroska Block MAY have no Duration. In that case, a player MUST display a Segment within a Matroska Block until the next Segment is encountered.
A muxer MAY use a Duration, e.g., by calculating the distance between two subsequent Segments.
If a Matroska Block has a Duration, a player MUST display that Segment only for
the duration of the BlockDuration
.
HDMV Text Subtitles
The specifications for the HDMV Text Subtitle format (short: HDMV TextST) can be found in the document “Blu-ray Disc Read-Only Format; Part 3 — Audio Visual Basic Specifications” in section 9.15 “HDMV text subtitle streams”.
Track Parameters
The CodecID
to use is S_HDMV/TEXTST
.
A CodecPrivate
element is required. It MUST contain the stream’s Dialog Style Segment
as described in section 9.15.4.2 “Dialog Style Segment” of the Blu-ray specifications.
Matroska Blocks
Each HDMV Dialog Presentation Segment (short: Segment) will be stored in a Matroska Block. A Segment is the data structure described in section 9.15.4.3 “Dialog presentation segment” of the Blu-ray specifications.
Each Segment contains a start and an end presentation timestamp (short: start PTS & end PTS). The start PTS will be used as the timestamp for the Matroska Block. The Matroska Block MUST have a Duration, and that Duration is the difference between the end PTS and the start PTS.
A player MUST use the Matroska Block
’s timestamp and BlockDuration
instead of the Segment’s
start and end PTS for determining when and how long to show the Segment.
Character set
When TextST subtitles are stored inside Matroska, the only allowed character set is UTF-8.
Each HDMV text subtitle stream in a Blu-ray can use one of a handful of character sets. This information is not stored in the MPEG2 Transport Stream itself but in the accompanying Clip Information file.
Therefore, a muxer MUST parse the accompanying Clip Information file. If the information indicates a character set other than UTF-8, it MUST re-encode all text Dialog Presentation Segments from the indicated character set to UTF-8 prior to storing them in Matroska.
Digital Video Broadcasting (DVB) subtitles
The specifications for the Digital Video Broadcasting subtitle bitstream format (short: DVB subtitles) can be found in the [@!ETSI.EN300-743] document. The storage of DVB subtitles in MPEG transport streams is specified in the [@!ETSI.EN300-468] document.
Track Parameters
The CodecID
to use is S_DVBSUB
.
The CodecPrivate
element is five bytes long and has the following structure:
-
2 bytes: composition page ID (bit string, left bit first)
-
2 bytes: ancillary page ID (bit string, left bit first)
-
1 byte: subtitling type (bit string, left bit first)
The semantics of these bytes are the same as the ones described in section 6.2.41 “Subtitling descriptor” of [@!ETSI.EN300-468].
Matroska Blocks
Each Matroska Block consists of one or more DVB Subtitle Segments as described in section 7.2 “Syntax and semantics of the subtitling segment” of [@!ETSI.EN300-743].
Each Matroska Block SHOULD have a Duration indicating how long the DVB Subtitle Segments in that Block SHOULD be displayed.
ARIB (ISDB) subtitles
The specifications for the ARIB B-24 subtitle bitstream format (short: ARIB subtitles) and its storage in MPEG transport streams can be found in the documents [@!ARIB.STD-B24], [@!ARIB.STD-B10], and [@!ARIB.TR-B14].
Track Parameters
The CodecID
to use is S_ARIBSUB
.
The CodecPrivate
element is three bytes long and has the following structure:
-
1 byte: component tag (bit string, left bit first)
-
2 bytes: data component ID (bit string, left bit first)
The semantics of the component tag are the same as those described in [@!ARIB.STD-B10], part 2, Annex J. The semantics of the data component ID are the same as those described in [@!ARIB.TR-B14], fascicle 2, Vol. 3, Section 2, 4.2.8.1.
Matroska Blocks
Each Matroska Block consists of a single synchronized PES data structure as described in chapter 5 “Independent PES transmission protocol” of [@!ARIB.STD-B24], volume 3, with a Synchronized_PES_data_byte block containing one or more ISDB Caption Data Groups as described in chapter 9 “Transmission of caption and superimpose” of [@!ARIB.STD-B24], volume 1, part 3. All of the Caption Statement Data Groups in a given Matroska Track MUST use the same language index.
A Data Group is normally shown until a subsequent Group provides instructions to clear it. Therefore, the Matroska Block SHOULD NOT have a Duration. A player SHOULD display a Data Group within a Matroska Block until its internal duration elapses, or until a subsequent Data Group removes it.