MPEG is an extensible data format for audio/video streams. MPEG defines a format for compressed video, compressed audio, and a way of packing these streams into a single file, e.g. for transport on a DVD-ROM disc. MPEG also allows so-called "private" extensions (streams whose format is not covered by the MPEG specification) to be included with the audio and video.
DVD (actually, DVD-Video) is a system specification for a disc containing media, intended for set-top, or personal computer, playback. DVD-Video builds on the DVD-ROM physical disc format, the MPEG formats for audio, video and multiplexing, and other formats, such as L-PCM, AC3 and DTS audio, as well as system-specific formats for navigation information, subtitles etc.
This document attempts to explain much about the format of the media files on a DVD disc. These files have the file extension .VOB.
All media begins life as so-called "elementary streams". This is simply a way of denoting a file containing just one single type of AV data. For example, video is one elementary stream. If a DVD disc contains 3 audio streams, each stream is itself an elementary stream. Each subtitle stream is also a separate elementary stream. Note that DVD disc containing multiple angles don't actually have multiple video streams in the typcialy MPEG sense - more on that later.
Elementary streams are sometimes captured together. For example, if you're authoring a DVD from a home-video filmed with a camcorder, your camcorder will have captured both the video and audio streams at the same time. This data was probably captured into a single .AVI file on your hard disc, which contains both audio and video data. However, when the DVD authoring tool creates the actual data to place on the DVD disc, it must extract the audio and video from the .AVI file separately, process them a little, then recombine them.
The authoring process for complex DVDs, such as Hollywood motion pictures, is often a lot more complex. The video stream will be edited together from a number of cameras, audio tracks will be prepared somewhat separately, from separately taped audio, dubbing, and also editing. Translations would be recorded at a later time (naturally, using the original video as a reference). Subtitles will be created to match the final video cut. All of these streams tend to reside in different files - these files are the elementary streams themselves. So, a DVD authoring tool must read a number of files and merge them together to form the data written to the DVD disc.
An access unit is a segment of an elementary stream that represents a small logical unit of data. Often, an access unit can be used directly without reference to other parts of the elementary stream. For example, a single frame of video may be an access unit, or a section of an AC3 file representing a 32ms period of time, when decoded.
Access units are important primarily because they allow random access to any part of the AV stream - the user of a DVD player can skip to any chapter they feel like and begin playback, or instruct their DVD player to begin playback at a certain time into the movie.
The DVD-Video specification limits the size of access units to a certain upper bound. In turn, this allows set-top DVD players to calculate the minimum amount of RAM they must have to buffer DVD data from the disc whilst decoding it. This is important because embedded players usually contain the bare minimum amount of RAM possible, due to the cost-sensitve nature of consumer electronics.
Typically, an elementary stream contains just the raw data in an appropriate format. For example, a linear PCM (L-PCM) audio elementary stream would contain just the actual audio samples, with no headers. Any timing information (e.g. the exact time at which any given audio sample should be played, or presented, to the user) is implicit from the format and parameters of the stream - each sample is to be presented at a constant time offset from the previous sample.
The packetization process is typically very simple. The elementary stream is broken up into a series of fixed-size chunks of bytes, usually without regard for the location of access unit boundaries. Each of these chunks is known as a "packet" and will fit within a single sector on the disc. An elementary stream that has been divided into packets is known as a "Packetized Elementary Stream", or PES for short.
Each packet has its own header (the "packet header"), so that each sector ends up with less than 2048 bytes of elementary stream data. In some special cases, multiple packets can be packed into a single sector. This necessitates a second level of headers - the "pack" header, which is global to an entire sector irrespective of how many packets are contained within it. A pack is always exactly equal in size to a sector on the disc.
Packetization can be complicated by non-seamless streams. Both audio and video typically consist of a single contiguous stream of data - we expect there to be no gaps in playback. Consequently, it's quite common for a single pack to contain the end of one access unit and the beginning of another. However, for subtitle streams, access units are not presented back to back - it's quite conceivable to have a five minute gap between two different subtitle images. For such types of elementary streams, it's often the case that a single pack will only contain data from a single access unit - any spare space will be made up by the addition of a padding packet to the pack. This also occurs at the end of a VOB, where the data in the elementary stream does not exactly fill a whole pack.
More to come (see HTML comments for outline)