Saturday, October 7, 2017

VP9 ELI5 Part 5

This is a challenging chapter. If anything is unclear, please leave a comment. Having Q/A under the post will benefit everyone.

Now that we've completed the uncompressed header, we need to create the compressed header. Once we create it, we can find its length (in bytes) and fill in the final field in the uncompressed header that we couldn't fill during the previous chapter.

You may be wondering how compression works. There are many types of compression and they fall into 2 categories: lossy and lossless. MP3 sound and MP4 video are lossy compressors (they lose some quality) while Zip files are lossless (they don't lose anything). VP9 overall is a lossy compressor, but it includes a lossless compression step. That's what the compressed header is compressed with.

So how do we compress this header data without losing anything? We're going to use a very clever system called a range encoder. In this case it's a binary arithmetic coder because we are writing 0's and 1's.

The range encoder depends on probabilities. Don't worry, this isn't going to be hard like Prob and Stats in college. The basic rule is that a probability is a number anywhere from 0 to 1. To find a probability, we use the following formula:

P will be between 0 and 1, inclusive.

0 means guaranteed never to happen, and 1 means guaranteed to happen. Anything in between reflects a chance that something will happen.

For example, let's find the probability of getting heads when flipping a coin. You have 1 (the number of "heads" sides on a coin) divided by 2 (the number of sides on a coin). Your probability will be 1/2 or 0.5 because either side is equally likely to occur.

For another example, let's say you're watching a game show and the host spins a wheel to see if a player won a prize. If there are 50 places on the spinner and only 1 has the prize, then you have a probability of 1/50 or 0.02.

But what event are we expecting when compressing VP9? In other words, what event do we want to find the chance of? It's quite simple. We want to know the chance that the next bit in the bitstream will be 0.

Once we know that, we multiply our probability value (P, between 0 and 1) by 256 and round to the nearest whole number to get our VP9 probability. So if we have a 1 out of 50 chance (shown above) of encountering a 0 in the bitstream then P would be equal to 0.02. We multiply 0.02 by 256 to get 5.12 and round to 5. 5 is the probability we'd use to compress.

Why do I need to know the chance of finding a 0?

It's because the compressor takes two inputs: bits, and the overall chance that a bit will be 0. The output is a string of bytes that we will put in a hex editor. There are YouTube videos you can watch (one is shown below) explaining exactly why, but for simplicity's sake just accept that if you know how many 0's to expect, you can write less (a LOT less) data. What kind of "a LOT less", you ask? Like going from 100 MiB to 15 MiB without losing anything.

(Advanced theory, not necessary to continue)


Of course, this only works if the decompressor knows the probability of finding a 0, so we tell the video player beforehand. That's what the compressed header is mainly about: telling the player what the chances are of finding a 0 in the bitstream.

The decompression process is pretty straightforward, but I couldn't figure out how to reverse it and compress. Fortunately, I don't have to understand exactly how to compress and neither do you because there is free source code that does it. The license is very permissive so you can even use it in your own programs.

Because we will be using someone else's algorithm for the range compressor, all we need to worry about is what data to write, and the probability that a bit will be 0.

And, to make things easier, a lot of the time we will just pretend like we're flipping a coin by setting the probability to 0.5. That means the VP9 probability would be 128. A lot of bits in VP9 are compressed using a probability of 128. There is no compression benefit when the chances of 0 and 1 are the same, but they still do it in some parts of VP9.

[Just a reference; not necessary in this chapter] The source code we need is located at https://github.com/webmproject/libvpx/blob/master/vpx_dsp/. We are interested in the files bitwriter.c and bitwriter.h. However, at this stage it's just for reference. I've adapted the C++ code to Java and will show you the finished output. Later you will need to either copy or adapt the code so you can compress image data in bulk.

Here is the Java code we'll start with.


As you can see, we need to insert some code in the middle. See the line that says "vpx_write_bit(br, 0);"? We need to add some more lines like that. When finished, this code will compress the bits we provide into a string of bytes that it will print in the console at the bottom. When we're done, we'll enter those bytes into a hex editor.

Now that we know how to compress, we need to look at what to compress.

This process will be similar to what we did for the uncompressed header. I'll go bit-by-bit and explain everything. Here is what we need to encode for the compressed header.

(Taken from the VP9 Spec PDF)

We actually don't have much to encode here. We only have 3 sections: read_tx_mode, read_coef_probs, and read_skip_prob. The vast majority of what remains is for non-intra frames, which doesn't apply to what we're encoding since this is an intra frame.

So this is all we actually have to write:



First we'll collect our bits in a list like in the previous chapter. Then I'll show you how to compress them.

The idea here is that VP9 has default probabilities for everything. To avoid the complexity of having to find our probabilities and write them here, we'll just write a header that says we'll only be using default probabilities.

read_tx_mode: (All probabilities = 128)
      tx_mode: 2 bits (11) for ALLOW_32X32
      tx_mode_select: 1 bit (0)

read_coef_probs: (All probabilities = 128)
      update_probs: 1 bit (0)
      Do not update any probabilities for 4x4 transform coefficients

      update_probs: 1 bit (0)
      Do not update any probabilities for 8x8 transform coefficients

      update_probs: 1 bit (0)
      Do not update any probabilities for 16x16 transform coefficients

      update_probs: 1 bit (0)
      Do not update any probabilities for 32x32 transform coefficients

read_skip_prob: (All probabilities = 252)
These bits are known to be 0 most of the time, so the designers chose 252/256 as the probability of finding a 0.

      Change the skip prob for context 0? 1 bit (0)
      We do not want to change any probabilities, so this is 0.

      Change the skip prob for context 1? 1 bit (0)
      We do not want to change any probabilities, so this is 0.

      Change the skip prob for context 2? 1 bit (0)
      We do not want to change any probabilities, so this is 0.

Writing the code

Let's put what we've learned into action and get our VP9 bitstream moving forward again.


We have a function called vpx_write_bit(). That writes a bit to the compressed bitstream with a probability of 128. But we don't just want to write single bits; we have a multi-bit number called tx_mode at the top. How do we write that? We use a function called vpx_write_literal(). In VP9, a literal is a multi-bit number that is written with equal probability of finding a 0 or 1. The individual bits are simply written from left (most significant) to right (least significant). For example, a literal of 100 (binary) would be written as a 1 and then two 0's, in that order.

You may be wondering why there's already a vpx_write_bit(br, 0); line. Why do we start by writing a 0? Because the player expects to see a 0 at the beginning of any compressed section. If it doesn't, it will consider the bitstream invalid. The first 0 is discarded when playing the video.

Now that we've established that, we need to start writing the useful data. Let's add a line to our code:

      vpx_write_literal(br, 3, 2); //ALLOW_32X32

The br is the bitwriter object, 3 is the decimal equivalent of 11 binary, and 2 means we want 2 bits.

Now we need to write some single bits.

vpx_write_bit(br, 0); //tx_mode_select
vpx_write_bit(br, 0); // read_coef_probs 4x4
vpx_write_bit(br, 0); // read_coef_probs 8x8
vpx_write_bit(br, 0); // read_coef_probs 16x16
vpx_write_bit(br, 0); // read_coef_probs 32x32
vpx_write(br, 0, 252); //read_skip_prob, context 0
vpx_write(br, 0, 252); //read_skip_prob, context 1
vpx_write(br, 0, 252); //read_skip_prob, context 2


Here's what the code should look like at the end:



Notice that we have vpx_stop_encode(br) at the end. Throughout the compression process, we are modifying a string of bytes and we need to cleanly end the compression process before we can output them.

Run the program and it will give the following output:


The single output byte, 96, is a decimal value. In hex it's 0x60. We need to ensure that a video player won't run out of bytes when it's reading the compressed header, so we'll add 1 padding byte, a value of 0, just to be safe. That would be 96 0 (decimal) or 0x60 0x00 (hex).

That means that the length of our compressed header is 2 bytes. Now we can go back and fill in the header_size_in_bytes field we skipped in the previous chapter.

Start your hex editor and open uncompressed_header, the file we saved in the last chapter.



We need to change the last 2 bytes from 00 00 to 00 02. Since the first byte was already 00, I only had to change the second one (shown in red).



Save the file. Your uncompressed header is now complete.

Compressed header in hex

This part is very easy. Create a new file in the hex editor and enter 60 00 in the hex area on the left (not the text area on the right). Make sure you're in hex mode (you will be by default; you'd know if you weren't)

You should have an empty hex file and have the text/selection cursor at the very beginning as shown.


Enter 60 00 in the hex area on the left.


Now save it as compressed_header.

I recommend moving your headers into a folder called VP9 Files.



You now have valid headers for a VP9 intra frame. The next step is to encode the image data.

Wednesday, October 4, 2017

VP9 ELI5 Part 4

Okay, so now that we've examined what RAW video data is composed of, we can get to work creating a valid VP9 video file.

For this step, we'll need to look at the VP9 bitstream specification PDF. Don't worry, I'll walk you through the whole thing. But first, read the section on video containers.

Video Containers

Once we have a valid VP9 frame we will need a container to hold our finished VP9 video. A container is simply a file format that contains information for a video player like Media Player Classic to find individual frames in a file that would otherwise have been RAW.

Good container examples are *.MP4 and *.WAV files. MP4 isn't actually a video format, it's a file structure that can contain several different video formats, such as H.264. In a similar way, WAV isn't an audio format, it's simply a file that contains information about how to play RAW audio, usually LPCM.

We will be using an *.IVF (Indeo Video File) video container to hold our finished VP9 stream. It's really simply and easy to build in a hex editor.

Putting together a VP9 video

A standard VP9 bitstream is not a whole video as you might have assumed, but instead an individual frame. Think of a container (in our case, *.IVF file) as a ZIP archive with individual images. When encoding VP9, instead of compressing the entire video into one bitstream, we encode one frame, write it to the container file, and repeat until the end.

So a VP9 bitstream is just one frame. Here is how it is structured:

Adapted from the VP9 Bitstream Specification PDF.

We start with the uncompressed header (easy to generate), then write the compressed header, and finally start writing tile (image) data.

Advanced info: For simplicity and coding efficiency, we will only be dealing with 1 tile in our VP9 frames.

Here's the structure of a VP9 frame in text form. Don't worry that this doesn't make sense yet. We'll go bit-by-bit (literally).


Let's open Notepad++ and get started hand-crafting our bitstream.

We'll skip the first line startBitPos = get_position() and move on to uncompressed_header().

The bitstream spec defines different data structures depending on certain video parameters, so we need to define our parameters beforehand. We know it's a 960x540 image, but what Profile will the video be? VP9 supports 4 profiles, 0-3. Profile 0 is the default that YouTube videos are saved as. It can only carry 8-bit YUV 4:2:0. Profile 2 (the third profile) supports 10-bit pixels with YUV 4:2:0 color, so we'll be using Profile 2.

Here are the parameters we need to write, in order, in our case.

Uncompressed Header


frame_marker: 2 bits (1 0)
Always the case

profile_low_bit: 1 bit (0)
profile_high_bit: 1 bit (1)

Since Profile is 2 and 2 in binary is 10, the high bit is 1 and the low bit is 0.

show_existing_frame: 1 bit (0)
We are not trying to show a previously-decoded frame

frame_type: 1 bit (0)
0 means key frame, or initial image

show_frame: 1 bit (1)
Yes, we want to show the frame

error_resilient_mode: 1 bit (1)
Enable error resilient mode

Since frame_type = 0 (keyframe),
      frame_sync_code: 24 bits (01001001 10000011 01000010)
      color_config:
            ten_or_twelve_bit: 1 bit (0)
            0 because this is 10-bit data.

            color_space: 3 bits (010)
            Color space equals 2 (binary 010) for BT.709

            color_range: 1 bit (0)
            This designates Studio Swing, the 16-235 and 64-940 range limit we discussed earlier.
            However, it is not enforced by the VP9 spec. It must be enforced by players.

      frame_size:
            frame_width_minus_1: 16 bits (0000001110111111)
            959 in binary form

            frame_height_minus_1: 16 bits (0000001000011011)
            539 in binary form

      render_size:
            render_and_frame_size_different: 1 bit (0)
            The render size will be the same as the frame size, 960x540

frame_context_idx: 2 bits (00)
Save our frame context into the 0 position
(not really important since we reset after each frame)

loop_filter_params:
      loop_filter_level: 6 bits (000011)
      Let's set this to 3 (11 binary)

      loop_filter_sharpness: 3 bits (011)
      Let's set this to 3 (11 binary)

      loop_filter_delta_enabled: 1 bit (0)
      Do not enable loop filter delta (being able to change the loop filter for
      different parts of the image)

quantization_params:
      base_q_idx: 8 bits (00000001)
      Base quantizer (divisor for shrinking large values later on)
      This provides the lossy part of VP9 compression. I've set it to 1 because
      this is more about understanding the format and creating a valid image than
      about compressing.

      delta_coded: 1 bit (0)
      No change to the quantizer required for the DC component of transforms of B/W image
      (transforms are covered later)

      delta_coded: 1 bit (0)
      No change to the quantizer required for the DC component of transforms of
      color add-on data

      delta_coded: 1 bit (0)
      No change to the quantizer required for the AC components of transforms of
      color add-on data

segmentation_params:
      segmentation_enabled: 1 bit (0)
      Segmentation is a really cool feature but for simplicity we won't be using it.

tile_info:
      increment_tile_cols_log2: 1 bit (0)
      Do not increment that variable. Keep it at 0 since 2 ^ 0 = 1 and we only want 1 tile column.

      tile_rows_log2: 1 bit (0)
      Do not increment that variable. Keep it at 0 since 2 ^ 0 = 1 and we only want 1 tile row.

header_size_in_bytes: 16 bits (Can't know it yet; must complete next chapter first)
We'll pretend it's 16 0's (0000000000000000) for now.


Putting it together


So if you enter all these bits in order into Notepad++, you should start to have something like this:

(Obviously I superimposed this blog entry over Notepad++; I did not somehow manage to type formatted text in there)

Putting spaces between the fields is optional. I did it for clarity.

Anyway, once you've entered all the bits into Notepad++ (including the 16 0's for the end data we don't know), then you will need to press Ctrl+H to get the Find/Replace dialog if you added spaces. Skip this paragraph if you didn't add spaces. Clear the search field and press the spacebar once. Clear the Replace With field and press Replace All. Now there shouldn't be any spaces in the file.

Now that the spaces are gone, look at the status bar at the bottom of the window. If you entered the bits properly and removed any spaces, it should say Length: 112.


112 is divisible by 8 so this will fit in 14 bytes (112 / 8) without a need for extra empty bits to make a whole byte.

Now you'll need to separate the string of bits into groups of 8. Just start at the beginning, use your left arrow key to advance 8 characters, press Enter, and repeat. I know this is tedious, but don't worry, we only do this once.

It should look like this when it's done.


Now you can either manually copy-paste this one-by-one into MS Calculator in Bin mode and convert to Dec (tedious), or you can copy-paste the whole list into the top box at https://www.textfixer.com/numbers/binary-to-decimal.php (recommended) and use that tool to get the whole list at once.


Now open a hex editor with decimal input support, such as HHD Hex Editor Neo. Create a new file, press Ctrl+1 (for Hex Editor Neo) to set it to decimal mode, click in the first blank entry in the large area on the left and start typing these numbers in order, pressing Enter after each one.


Save this file as "uncompressed_header". We will come back to it after we finish the next chapter.

Tuesday, October 3, 2017

VP9 ELI5 Part 3

For this part, I'll need to define how the input image data (what we're encoding) is stored.

Pixels and color storage methods

Up until now, I only said that image data is not RGB, but rather YUV. But that doesn't tell you anything about how it's stored on the computer. The image, as you know, is 960x540 pixels. That means 960 pixels wide by 540 pixels high. When it was taken by the camera and when it's shown on your screen, it is merely a 2-dimensional array of pixels. It was not subdivided because it's just a plain image, not yet encoded with VP9.

Let's zoom in so you can get an idea of the individual pixels in the B/W image.



If you use the eyedropper tool in MS Paint, you can pick up a color from an image. You can see just what that color's RGB value is by going into Edit Colors.



As you can see, white is 255, 255, 255. As we discussed earlier, pixel values in VP9 follow a similar system but with YUV. Shades of gray, which is what the B/W YUV image is made up of, can be made in RGB by setting all three numbers to the same value.

Most importantly, why do you think MS Paint is limited to 255? It's because colors on a computer are almost always 8 bits (bits means 0's and 1's) per channel. On a computer screen, channels are RGB, hence MS Paint's use of RGB instead of YUV.

If you have 8 bit fields capable of holding 0 or 1, you can think of it as a row of light switches. How many combinations of On and Off can you come up with by flipping 8 light switches? The answer is 2 raised to the 8th power, or 256. That's because you have 2 states (On/Off) and you have 8 places that can be set to either state. But if you have 10 light switches, that would be 2 to the 10th, or 1024.

You can't reach 256 with 8 bits because 256 is how many states you have. One of them is 0. It would be like having 60 minutes in an hour but ending each hour on a clock at 60 instead of 59.

In this tutorial we will be dealing with 10-bit pixels, capable of ranging in value from 0 to 1023 instead of the usual 8-bit values you may be familiar with that range from 0-255. This is because it's a proven fact that 10-bit video carries higher quality at a given bitrate than 8-bit video. I'm not joking when I say that a 10-bit 1080p video at 1 Mbps can actually look a lot better than an 8-bit 1080p video at 2 Mbps.

Understanding RAW image data

You may have heard of RAW images taken by expensive DSLR cameras, but that's not what we're talking about here. In video, RAW data is a bit different.

A common format for RAW video data is YUV420p. 8 bits per value is implied because a bit depth wasn't specified. You can export to this format using a tool called FFmpeg. I'll describe how YUV420p files are structured before we move on because although we won't be using YUV420p, we will be using YUV420p10LE which is quite similar but more complicated.

In a RAW YUV420p video file, each frame is stored one after the other with no separators. It is simply the Y (B/W image), U and V. Then it repeats for the next frame.

You're probably familiar with the fact that computer data is stored as bytes, and that bytes are 8 bits each. Therefore, a byte can represent any number from 0 to 255. In a readable text file, each letter or other character is simply a number (usually less than 128) that your text reader knows corresponds to a letter or character. But if you've ever opened an image or program in a text editor, you know that they contain weird random characters. That's because the bytes in the file are storing numbers that don't correspond to readable characters.

(A PNG file opened in Notepad)

In a YUV420p video, we first store the B/W image pixel-by-pixel in raster scan fashion, meaning left to right and top to bottom. Note that there is no subdivision here. The raster scan does the whole image line by line.


So you see, raster scanning begins at the absolute top-left pixel, scans to the right until the end, goes down one, jumps back to the left, and goes again.

[You don't need to know the following formulas unless you choose to follow along. If you're just reading then you'll be fine without them]

That will take up (width * height) bytes, since each pixel takes one byte. Then we need to store the U and V images. Well, since the width and height of those are 1/2 the dimensions of the B/W image, the data for U and V will each be ((width / 2) * (height / 2)) bytes. Since we must copy U and V like this, that will be 2 * ((width / 2) * (height / 2)) bytes.

So now we have (width * height) + (2 * ((width / 2) * (height / 2)) bytes of data for just one frame. With an image size of 960x540, that comes out to 777,600 bytes total.

So for each frame in our YUV420p RAW video, the first 518,400 bytes will be the B/W image, the next 129,600 bytes will be the U image, and the remaining 129,600 bytes will be the V image.

Understanding YUV420p10le RAW image data

YUV420p only handles values from 0 to 255 because it's an 8-bit format. Since we want to work with 10-bit image data, we need a way to store values from 0 to 1023. This is where YUV420p10le comes in.

Since files are stored as bytes and bytes are multiples of 8 bits, we can't easily store 10-bit values because we'd have a fraction of a byte left over after almost every 10-bit value we write. This is why YUV420p10le stores the 10-bit values inside a larger 16-bit value. Since 16-bit numbers can range from 0 to 65,535 it will easily store our 0-1023 values.

So now we have to store 2 bytes for each value (16 bits = 2 bytes). The file structure stays the same so our file size doubles to 1,555,200 bytes or 1.48 MiB.

By now you may have guessed that having a higher range of available values makes for a higher-quality image. In a normal PC image, for example, 8 bits per color channel times 3 channels (RGB) equals 24 bits per pixel overall, or 24.7 million different colors. But what if you only have 256 colors at your disposal? The image will look quite bad.

(The image in only 256 colors. Click to enlarge and see how bad it is)

Videos encoded with 8-bit YUV420p look great. In fact, it's the bit depth that all YouTube videos are encoded with, even the ones that claim to be HDR which is at least 10-bit. But 10-bit video is better because you can express a far greater range of light and dark. Converting from 8-bit to 10-bit increases the range by simply multiplying the values to bring them up to the 10-bit range, but there will not be any increase in actual quality because you can't get information that was never there to begin with.

If that sounds confusing, let's say you have a gradient of 8-bit pixels: {15, 16, 17, 18, 19}. Now if you wanted to convert them to 10-bit, you would multiply each entry by 4 to get {60, 64, 68, 72, 76}. But notice that you don't have any values in-between whole multiples of 4. If you had recorded the video with a 10-bit camera you could have any value in the 10-bit range (theoretically), but since you converted from 8-bit you will only have the 256 different 8-bit levels spread evenly within 1024 total levels.

When you convert an 8-bit video to 10-bit, this is all you're doing.

This means that in our case the only purpose of converting to 10-bit is so we can have 10-bit values for our VP9 encoder to work with.

Opening RAW videos in a hex editor

Let's see what RAW video data looks like in a hex editor. What you're looking at in these hex screenshots are the pixel values at the beginning of the file, the first few pixels in the top row of the B/W image. These are shades of gray expressed as numbers, but if that doesn't make sense then I'll try to explain.

Here is what the 8-bit image looks like in a hex editor:

On the left we have hex digits and on the right, bytes as they'd be seen in a text editor

And this is the image in 10-bit format:

In this 10-bit example, each pixel value takes up 2 bytes.

There is an annoying quirk you have to watch out for on multi-byte numbers. Most computers read and write them as Little Endian which means byte-reversed.

Let's take the example of 255 which is 0xFF in hex. Hex works by using 0-9 and A-F to make a base-16 number system. 2 hex digits make a byte.

If we have a hex byte of 0xFF (255 decimal) and add 1 to make 256, it would "overflow" and we would get 2 hex bytes: 0x01 0x00. On Windows Vista and up, you can try setting Windows calculator to Programmer and use the Hex mode to enter 0100, then click Dec and it will say 256.

So now that we've established that, let's say you wanted to record that number in a file. You know about place value in the decimal system (tens, hundreds, etc) so you'd expect to just write 0x0100 to your file, right? Well, it turns out that due to a computer quirk carried from the 1970's, the computer would prefer it if you did it in reverse as 0x00 0x01. This applies even to longer hex numbers like 0xDEADBEEF. Each pair of hex digits must be written in reverse order; do not write the single digits in reverse order. 0xDEADBEEF should become 0xEFBEADDE, not 0xFEEBDAED.

Since this weird reverse notation is called Little Endian, it makes sense that the normal order you'd expect to use is called Big Endian. Keep in mind that the bytes (pairs of hex digits) are what's reversed, never bits or single hex digits.

Little Endian is what the LE in YUV420pLE stands for.

So now let's set our hex editor to group by Words (Word = 16 bits).

By default, it's set to Little Endian so these hex bytes are reversed to Big Endian for proper viewing.

Let's press Ctrl+1 or go to View -> Display As -> Decimal. Now you can see that the values are all normal human-readable numbers and that they never exceed 940.


[Random trivia, not necessary to continue] Why don't the values exceed 940 when we were supposed to convert to a 0-1023 range? Because video of photographic subjects (ie not computer screen captures) customarily ranges not from 0-255 or 0-1023, but rather from 16-235 or from 64-940 (16-235 times 4).

Now you can see what I mean by shades of gray expressed as numbers.

VP9 ELI5 Part 2

Here's the photo split into its 3 YUV components:
The U and V are the color add-on data. Notice how they are 1/4 the size of the B/W image.

Most video formats, VP9 included, depend on both intra and inter frames. An intra frame is a normal picture while an inter frame is just the difference between the current frame and a previous one. Think of it as only saving what moved in the current frame instead of saving the whole picture again.

Obviously we need to start a video with an intra frame, also known as a key frame, so that we can build future inter frames based on it.

Unlike BMP's, we don't save VP9 images pixel-by-pixel in order. Instead, we divide the image on several levels. The first, highest level of division is into 64x64 superblocks.

64x64 is a very large subdivision size. In MP4 (H.264), the largest size available is 16x16. Notice the difference below:

16x16

64x64

The superblocks are processed in raster order. That means left to right, top to bottom. In other words, each horizontal line of superblocks is processed from left to right, in order from top to bottom.

Each superblock is then optionally subdivided further. Here are some possible subdivisions:



Notice what subdivisions are allowed. Shown from left to right, we can split a superblock:
  • Vertically to make 16x32,
  • Horizontally to make 32x16, or
  • Four ways to make 4 32x32 blocks

After that, we can split the 32x32 blocks the same way or we could leave them alone. However, we may not further split one of the non-square sizes. Any non-square split is final in VP9. Also note that 4x4 is technically as small as we can get but things get messy at that size so for this tutorial we will not go smaller than 8x8.

This is just the first level of subdivision. The next level is the transform level and is more interesting.

VP9 ELI5 Part 1

Welcome to my VP9 Explain Like I'm 5 series. I'm going to explain the VP9 codec so anyone can understand it and perhaps even create their own encoder.

What is VP9 and why should I care?

First things first, let's establish what VP9 is. It's a video codec created by Google and it's what almost all YouTube videos are saved as on the server. Previously, YouTube videos were all MP4 (specifically, H.264). There were two problems with that. The first is that it's patented and copyrighted, so you have to pay the MPEG group. Second, it takes a lot of data for any given quality level. Mobile consumers aren't the only ones who pay by the gigabyte for their data usage; businesses have to pay for that as well to serve people. Normally video has a tradeoff: you can have higher quality for more data usage, or less data usage with bad quality. VP9, amazingly enough, allows Google to pack higher quality into less data, achieving both benefits at once. Unfortunately, only Google and Netflix use VP9; everyone else still uses MP4.

Video definition

You probably already know it's a series of still pictures called frames. If you took pictures really fast with your still picture camera, you could play them quickly to see a crude video. Video is merely a set of still pictures played one after the other really fast.

What are these pictures composed of?

You may be thinking, "A picture is simply dots (pixels). Each one has an RGB (red, green, blue) color value. I know because I've done pixel art in MS Paint." Well, I've got to tell you, that's not how images work in a video.

To explain this, I'll need to give a brief history lesson. You see, video frames could easily use RGB colors like you might have assumed. However, the US TV system was defined near the beginning of World War II and only carried black-and-white video until the 1950's. So instead of red, green, and blue, the old TV system only carried brightness values (shades of gray). You can approximate that in MS Paint by using the same number for red, green, and blue.

In the 50's, they wanted to add color to TV without breaking compatibility with existing TV's. They weren't willing to go the 2009 route and make everyone buy a new TV. What they did was mix a small amount of color data onto the existing black-and-white signal. That way, old TV's would still decode it while color TV's would add the color info on top of the black-and-white picture.

The FCC had allocated 6 MHz bandwidth for TV channels. The engineers couldn't add too much extra color without the signal bleeding onto other channels, so they devised a system of saving only a low-res copy of the image in color format while saving the full-res image in B/W format. The low-quality color image was enlarged and super-imposed over the B/W image, making good color because the human eye sees more brightness than color.

Even though NTSC is nearly dead, this system lives on in computers as YUV 4:2:0. If you want to read more, check out the YUV Wikipedia page.

Let's get started

For this series, we'll be working with the following 960x540 color photo. In case you didn't know, that's 1/4 the size of 1920x1080.

Credit: Lt. Zachary West, 100th MPAD, Flickr, CC BY 2.0

This wasn't actually 960x540 but I cropped and resized so it would be a nice easy size.

If you want to follow along, you can download the image.