Internal structure of an ID3 v2.3 MP3 audio file

The Roberts Family

MP£ File Internal Structure

Internal structure of an ID3 v2.3 MP3 File

These diagrams are my attempt to visualise the internal structure of an mp3 file using ID3 version 2.3 tags. They are based on what I have read on the Internet. I confess that I did not always fully understand what I was reading, so should you find that I have anything wrong, please contact me and I will happily correct it.

To fully understand what is going on you may need to do some homework. You will need to know about these topics. (Or you can just use the dll without worrying too much about what is happening inside.)

Littlendian and Bigendian numbers
SyncSafe Integers
Text encoding and BOMs (byte order marks)

Overall Structure

This diagrams shows the overall structure of a MP3 file using ID3 version 2.3 metadata. In this context 'Tag' refers to the block of the file containing all of the v2.3 metadata. Beware - this is potentially confusing as normally 'tag' refers to a simple item of metadata, for example the artit's name. But not here.

Structure Of The Header Information Block

There is the main header, and, optionally, an extended header. (None of the files in my collection had an extended header.)

Diagram of header information block goes here.

Structure Of The Tag

Remember that in an mp3 file the tag is a block of the file holding all the metadata - i.e all of the things we commonly refer to as tags: artist, title, etc.

The tag is made up of frames, plus, optionally, padding. We are most interested in text frames as these hold the information about our music, one frame per item. There are other types of frame, most of which we can ignore.

Structure of a v2.3 ID3 MP3 Frame

A frame holds a single piece of information about the file.

Structure of a Text Frame

These are the frames that hold the textual data describing the track: artist='Prince', track='Purple Rain' etc.

Structure of a COMM Frame

We may also be interested in COMMENT frames. These may hold proprietory binary data, for example added by iTunes, or simple textual comments. I have chosen to process textual comments, whilst ignoring binary comments.

I could not find an online explanantion of the internal format of these that I was fully able to understand and that corresponded to what I saw in the handful of test files that I examined. The diagram below is my best guess, but take it with a 'pinch of salt'.

Errata

In February 2021 I received several e-mails from 'Timmy' suggesting that I may have some of this wrong. Here are his comments on my COMM frame diagram:

Comments are most probably only meant for readable text. In this new context the distinction between "binary" and "text" gets misleading since all text is binary; just differently structured binary.
A "flag byte" is mentioned. The wording can be slightly misleading since it's not optional and there are actually several options. That first byte decides the character encoding of all data after the language code.
It says that if the first (encoding) byte is set to 0x01, the description field should only be 0xFF 0xFE. These are a Byte Order Mark (BOM) that actually are used to start off a string (in this case only UTF-16 ones in LE ("non-BE") mode), but the string may be left empty. The order of the BOM bytes depends on the desired byte order.
A "2 byte separator of 0x00 0x00" is mentioned. Its length is only 2 bytes if the used character encoding is UTF-16(/BE). Otherwise it's 1 byte.

So a more accurate description of the data field (as I see it) would be:

  ----------------------------------------
    Text Encoding, 1 byte
        0x00 for ISO-8859-1, or
        0x01 for UTF-16 with BOM, or
        0x02 for UTF-16BE without BOM, or
        0x03 for UTF-8.
    Language Code, 3 bytes
        E.g. "eng" (0x65, 0x6E, 0x67).
    Comment Description, n byte(s)
        Only if the text encoding is 0x01 UTF-16, the description must start with a BOM; xFE xFF or xFF xFE.
        The description may have a length of 0, but must end with a terminating NULL character formatted accordingly:
        - If the text encoding is 0x00 ISO-8859-1 or 0x03 UTF-8, the NULL character is 1 byte; 0x00.
        - If the text encoding is 0x01 UTF-16 or 0x02 UTF-16BE, the NULL character is 2 bytes; 0x00 0x00.
    Comment, n byte(s)
        Comment ends abruptly wityhout ant termination by NULL characters.
  ----------------------------------------

Thanks Timmy!

If anyone else knows more than me, and sees errors with these notes, please contact me and I will either annotate them appropriately or take them down.