Saturday, October 7, 2017

VP9 ELI5 Part 5

This is a challenging chapter. If anything is unclear, please leave a comment. Having Q/A under the post will benefit everyone.

Now that we've completed the uncompressed header, we need to create the compressed header. Once we create it, we can find its length (in bytes) and fill in the final field in the uncompressed header that we couldn't fill during the previous chapter.

You may be wondering how compression works. There are many types of compression and they fall into 2 categories: lossy and lossless. MP3 sound and MP4 video are lossy compressors (they lose some quality) while Zip files are lossless (they don't lose anything). VP9 overall is a lossy compressor, but it includes a lossless compression step. That's what the compressed header is compressed with.

So how do we compress this header data without losing anything? We're going to use a very clever system called a range encoder. In this case it's a binary arithmetic coder because we are writing 0's and 1's.

The range encoder depends on probabilities. Don't worry, this isn't going to be hard like Prob and Stats in college. The basic rule is that a probability is a number anywhere from 0 to 1. To find a probability, we use the following formula:

P will be between 0 and 1, inclusive.

0 means guaranteed never to happen, and 1 means guaranteed to happen. Anything in between reflects a chance that something will happen.

For example, let's find the probability of getting heads when flipping a coin. You have 1 (the number of "heads" sides on a coin) divided by 2 (the number of sides on a coin). Your probability will be 1/2 or 0.5 because either side is equally likely to occur.

For another example, let's say you're watching a game show and the host spins a wheel to see if a player won a prize. If there are 50 places on the spinner and only 1 has the prize, then you have a probability of 1/50 or 0.02.

But what event are we expecting when compressing VP9? In other words, what event do we want to find the chance of? It's quite simple. We want to know the chance that the next bit in the bitstream will be 0.

Once we know that, we multiply our probability value (P, between 0 and 1) by 256 and round to the nearest whole number to get our VP9 probability. So if we have a 1 out of 50 chance (shown above) of encountering a 0 in the bitstream then P would be equal to 0.02. We multiply 0.02 by 256 to get 5.12 and round to 5. 5 is the probability we'd use to compress.

Why do I need to know the chance of finding a 0?

It's because the compressor takes two inputs: bits, and the overall chance that a bit will be 0. The output is a string of bytes that we will put in a hex editor. There are YouTube videos you can watch (one is shown below) explaining exactly why, but for simplicity's sake just accept that if you know how many 0's to expect, you can write less (a LOT less) data. What kind of "a LOT less", you ask? Like going from 100 MiB to 15 MiB without losing anything.

(Advanced theory, not necessary to continue)

Of course, this only works if the decompressor knows the probability of finding a 0, so we tell the video player beforehand. That's what the compressed header is mainly about: telling the player what the chances are of finding a 0 in the bitstream.

The decompression process is pretty straightforward, but I couldn't figure out how to reverse it and compress. Fortunately, I don't have to understand exactly how to compress and neither do you because there is free source code that does it. The license is very permissive so you can even use it in your own programs.

Because we will be using someone else's algorithm for the range compressor, all we need to worry about is what data to write, and the probability that a bit will be 0.

And, to make things easier, a lot of the time we will just pretend like we're flipping a coin by setting the probability to 0.5. That means the VP9 probability would be 128. A lot of bits in VP9 are compressed using a probability of 128. There is no compression benefit when the chances of 0 and 1 are the same, but they still do it in some parts of VP9.

[Just a reference; not necessary in this chapter] The source code we need is located at We are interested in the files bitwriter.c and bitwriter.h. However, at this stage it's just for reference. I've adapted the C++ code to Java and will show you the finished output. Later you will need to either copy or adapt the code so you can compress image data in bulk.

Here is the Java code we'll start with.

As you can see, we need to insert some code in the middle. See the line that says "vpx_write_bit(br, 0);"? We need to add some more lines like that. When finished, this code will compress the bits we provide into a string of bytes that it will print in the console at the bottom. When we're done, we'll enter those bytes into a hex editor.

Now that we know how to compress, we need to look at what to compress.

This process will be similar to what we did for the uncompressed header. I'll go bit-by-bit and explain everything. Here is what we need to encode for the compressed header.

(Taken from the VP9 Spec PDF)

We actually don't have much to encode here. We only have 3 sections: read_tx_mode, read_coef_probs, and read_skip_prob. The vast majority of what remains is for non-intra frames, which doesn't apply to what we're encoding since this is an intra frame.

So this is all we actually have to write:

First we'll collect our bits in a list like in the previous chapter. Then I'll show you how to compress them.

The idea here is that VP9 has default probabilities for everything. To avoid the complexity of having to find our probabilities and write them here, we'll just write a header that says we'll only be using default probabilities.

read_tx_mode: (All probabilities = 128)
      tx_mode: 2 bits (11) for ALLOW_32X32
      tx_mode_select: 1 bit (0)

read_coef_probs: (All probabilities = 128)
      update_probs: 1 bit (0)
      Do not update any probabilities for 4x4 transform coefficients

      update_probs: 1 bit (0)
      Do not update any probabilities for 8x8 transform coefficients

      update_probs: 1 bit (0)
      Do not update any probabilities for 16x16 transform coefficients

      update_probs: 1 bit (0)
      Do not update any probabilities for 32x32 transform coefficients

read_skip_prob: (All probabilities = 252)
These bits are known to be 0 most of the time, so the designers chose 252/256 as the probability of finding a 0.

      Change the skip prob for context 0? 1 bit (0)
      We do not want to change any probabilities, so this is 0.

      Change the skip prob for context 1? 1 bit (0)
      We do not want to change any probabilities, so this is 0.

      Change the skip prob for context 2? 1 bit (0)
      We do not want to change any probabilities, so this is 0.

Writing the code

Let's put what we've learned into action and get our VP9 bitstream moving forward again.

We have a function called vpx_write_bit(). That writes a bit to the compressed bitstream with a probability of 128. But we don't just want to write single bits; we have a multi-bit number called tx_mode at the top. How do we write that? We use a function called vpx_write_literal(). In VP9, a literal is a multi-bit number that is written with equal probability of finding a 0 or 1. The individual bits are simply written from left (most significant) to right (least significant). For example, a literal of 100 (binary) would be written as a 1 and then two 0's, in that order.

You may be wondering why there's already a vpx_write_bit(br, 0); line. Why do we start by writing a 0? Because the player expects to see a 0 at the beginning of any compressed section. If it doesn't, it will consider the bitstream invalid. The first 0 is discarded when playing the video.

Now that we've established that, we need to start writing the useful data. Let's add a line to our code:

      vpx_write_literal(br, 3, 2); //ALLOW_32X32

The br is the bitwriter object, 3 is the decimal equivalent of 11 binary, and 2 means we want 2 bits.

Now we need to write some single bits.

vpx_write_bit(br, 0); //tx_mode_select
vpx_write_bit(br, 0); // read_coef_probs 4x4
vpx_write_bit(br, 0); // read_coef_probs 8x8
vpx_write_bit(br, 0); // read_coef_probs 16x16
vpx_write_bit(br, 0); // read_coef_probs 32x32
vpx_write(br, 0, 252); //read_skip_prob, context 0
vpx_write(br, 0, 252); //read_skip_prob, context 1
vpx_write(br, 0, 252); //read_skip_prob, context 2

Here's what the code should look like at the end:

Notice that we have vpx_stop_encode(br) at the end. Throughout the compression process, we are modifying a string of bytes and we need to cleanly end the compression process before we can output them.

Run the program and it will give the following output:

The single output byte, 96, is a decimal value. In hex it's 0x60. We need to ensure that a video player won't run out of bytes when it's reading the compressed header, so we'll add 1 padding byte, a value of 0, just to be safe. That would be 96 0 (decimal) or 0x60 0x00 (hex).

That means that the length of our compressed header is 2 bytes. Now we can go back and fill in the header_size_in_bytes field we skipped in the previous chapter.

Start your hex editor and open uncompressed_header, the file we saved in the last chapter.

We need to change the last 2 bytes from 00 00 to 00 02. Since the first byte was already 00, I only had to change the second one (shown in red).

Save the file. Your uncompressed header is now complete.

Compressed header in hex

This part is very easy. Create a new file in the hex editor and enter 60 00 in the hex area on the left (not the text area on the right). Make sure you're in hex mode (you will be by default; you'd know if you weren't)

You should have an empty hex file and have the text/selection cursor at the very beginning as shown.

Enter 60 00 in the hex area on the left.

Now save it as compressed_header.

I recommend moving your headers into a folder called VP9 Files.

You now have valid headers for a VP9 intra frame. The next step is to encode the image data.

Wednesday, October 4, 2017

VP9 ELI5 Part 4

Okay, so now that we've examined what RAW video data is composed of, we can get to work creating a valid VP9 video file.

For this step, we'll need to look at the VP9 bitstream specification PDF. Don't worry, I'll walk you through the whole thing. But first, read the section on video containers.

Video Containers

Once we have a valid VP9 frame we will need a container to hold our finished VP9 video. A container is simply a file format that contains information for a video player like Media Player Classic to find individual frames in a file that would otherwise have been RAW.

Good container examples are *.MP4 and *.WAV files. MP4 isn't actually a video format, it's a file structure that can contain several different video formats, such as H.264. In a similar way, WAV isn't an audio format, it's simply a file that contains information about how to play RAW audio, usually LPCM.

We will be using an *.IVF (Indeo Video File) video container to hold our finished VP9 stream. It's really simply and easy to build in a hex editor.

Putting together a VP9 video

A standard VP9 bitstream is not a whole video as you might have assumed, but instead an individual frame. Think of a container (in our case, *.IVF file) as a ZIP archive with individual images. When encoding VP9, instead of compressing the entire video into one bitstream, we encode one frame, write it to the container file, and repeat until the end.

So a VP9 bitstream is just one frame. Here is how it is structured:

Adapted from the VP9 Bitstream Specification PDF.

We start with the uncompressed header (easy to generate), then write the compressed header, and finally start writing tile (image) data.

Advanced info: For simplicity and coding efficiency, we will only be dealing with 1 tile in our VP9 frames.

Here's the structure of a VP9 frame in text form. Don't worry that this doesn't make sense yet. We'll go bit-by-bit (literally).

Let's open Notepad++ and get started hand-crafting our bitstream.

We'll skip the first line startBitPos = get_position() and move on to uncompressed_header().

The bitstream spec defines different data structures depending on certain video parameters, so we need to define our parameters beforehand. We know it's a 960x540 image, but what Profile will the video be? VP9 supports 4 profiles, 0-3. Profile 0 is the default that YouTube videos are saved as. It can only carry 8-bit YUV 4:2:0. Profile 2 (the third profile) supports 10-bit pixels with YUV 4:2:0 color, so we'll be using Profile 2.

Here are the parameters we need to write, in order, in our case.

Uncompressed Header

frame_marker: 2 bits (1 0)
Always the case

profile_low_bit: 1 bit (0)
profile_high_bit: 1 bit (1)

Since Profile is 2 and 2 in binary is 10, the high bit is 1 and the low bit is 0.

show_existing_frame: 1 bit (0)
We are not trying to show a previously-decoded frame

frame_type: 1 bit (0)
0 means key frame, or initial image

show_frame: 1 bit (1)
Yes, we want to show the frame

error_resilient_mode: 1 bit (1)
Enable error resilient mode

Since frame_type = 0 (keyframe),
      frame_sync_code: 24 bits (01001001 10000011 01000010)
            ten_or_twelve_bit: 1 bit (0)
            0 because this is 10-bit data.

            color_space: 3 bits (010)
            Color space equals 2 (binary 010) for BT.709

            color_range: 1 bit (0)
            This designates Studio Swing, the 16-235 and 64-940 range limit we discussed earlier.
            However, it is not enforced by the VP9 spec. It must be enforced by players.

            frame_width_minus_1: 16 bits (0000001110111111)
            959 in binary form

            frame_height_minus_1: 16 bits (0000001000011011)
            539 in binary form

            render_and_frame_size_different: 1 bit (0)
            The render size will be the same as the frame size, 960x540

frame_context_idx: 2 bits (00)
Save our frame context into the 0 position
(not really important since we reset after each frame)

      loop_filter_level: 6 bits (000011)
      Let's set this to 3 (11 binary)

      loop_filter_sharpness: 3 bits (011)
      Let's set this to 3 (11 binary)

      loop_filter_delta_enabled: 1 bit (0)
      Do not enable loop filter delta (being able to change the loop filter for
      different parts of the image)

      base_q_idx: 8 bits (00000001)
      Base quantizer (divisor for shrinking large values later on)
      This provides the lossy part of VP9 compression. I've set it to 1 because
      this is more about understanding the format and creating a valid image than
      about compressing.

      delta_coded: 1 bit (0)
      No change to the quantizer required for the DC component of transforms of B/W image
      (transforms are covered later)

      delta_coded: 1 bit (0)
      No change to the quantizer required for the DC component of transforms of
      color add-on data

      delta_coded: 1 bit (0)
      No change to the quantizer required for the AC components of transforms of
      color add-on data

      segmentation_enabled: 1 bit (0)
      Segmentation is a really cool feature but for simplicity we won't be using it.

      increment_tile_cols_log2: 1 bit (0)
      Do not increment that variable. Keep it at 0 since 2 ^ 0 = 1 and we only want 1 tile column.

      tile_rows_log2: 1 bit (0)
      Do not increment that variable. Keep it at 0 since 2 ^ 0 = 1 and we only want 1 tile row.

header_size_in_bytes: 16 bits (Can't know it yet; must complete next chapter first)
We'll pretend it's 16 0's (0000000000000000) for now.

Putting it together

So if you enter all these bits in order into Notepad++, you should start to have something like this:

(Obviously I superimposed this blog entry over Notepad++; I did not somehow manage to type formatted text in there)

Putting spaces between the fields is optional. I did it for clarity.

Anyway, once you've entered all the bits into Notepad++ (including the 16 0's for the end data we don't know), then you will need to press Ctrl+H to get the Find/Replace dialog if you added spaces. Skip this paragraph if you didn't add spaces. Clear the search field and press the spacebar once. Clear the Replace With field and press Replace All. Now there shouldn't be any spaces in the file.

Now that the spaces are gone, look at the status bar at the bottom of the window. If you entered the bits properly and removed any spaces, it should say Length: 112.

112 is divisible by 8 so this will fit in 14 bytes (112 / 8) without a need for extra empty bits to make a whole byte.

Now you'll need to separate the string of bits into groups of 8. Just start at the beginning, use your left arrow key to advance 8 characters, press Enter, and repeat. I know this is tedious, but don't worry, we only do this once.

It should look like this when it's done.

Now you can either manually copy-paste this one-by-one into MS Calculator in Bin mode and convert to Dec (tedious), or you can copy-paste the whole list into the top box at (recommended) and use that tool to get the whole list at once.

Now open a hex editor with decimal input support, such as HHD Hex Editor Neo. Create a new file, press Ctrl+1 (for Hex Editor Neo) to set it to decimal mode, click in the first blank entry in the large area on the left and start typing these numbers in order, pressing Enter after each one.

Save this file as "uncompressed_header". We will come back to it after we finish the next chapter.

Tuesday, October 3, 2017

VP9 ELI5 Part 3

For this part, I'll need to define how the input image data (what we're encoding) is stored.

Pixels and color storage methods

Up until now, I only said that image data is not RGB, but rather YUV. But that doesn't tell you anything about how it's stored on the computer. The image, as you know, is 960x540 pixels. That means 960 pixels wide by 540 pixels high. When it was taken by the camera and when it's shown on your screen, it is merely a 2-dimensional array of pixels. It was not subdivided because it's just a plain image, not yet encoded with VP9.

Let's zoom in so you can get an idea of the individual pixels in the B/W image.

If you use the eyedropper tool in MS Paint, you can pick up a color from an image. You can see just what that color's RGB value is by going into Edit Colors.

As you can see, white is 255, 255, 255. As we discussed earlier, pixel values in VP9 follow a similar system but with YUV. Shades of gray, which is what the B/W YUV image is made up of, can be made in RGB by setting all three numbers to the same value.

Most importantly, why do you think MS Paint is limited to 255? It's because colors on a computer are almost always 8 bits (bits means 0's and 1's) per channel. On a computer screen, channels are RGB, hence MS Paint's use of RGB instead of YUV.

If you have 8 bit fields capable of holding 0 or 1, you can think of it as a row of light switches. How many combinations of On and Off can you come up with by flipping 8 light switches? The answer is 2 raised to the 8th power, or 256. That's because you have 2 states (On/Off) and you have 8 places that can be set to either state. But if you have 10 light switches, that would be 2 to the 10th, or 1024.

You can't reach 256 with 8 bits because 256 is how many states you have. One of them is 0. It would be like having 60 minutes in an hour but ending each hour on a clock at 60 instead of 59.

In this tutorial we will be dealing with 10-bit pixels, capable of ranging in value from 0 to 1023 instead of the usual 8-bit values you may be familiar with that range from 0-255. This is because it's a proven fact that 10-bit video carries higher quality at a given bitrate than 8-bit video. I'm not joking when I say that a 10-bit 1080p video at 1 Mbps can actually look a lot better than an 8-bit 1080p video at 2 Mbps.

Understanding RAW image data

You may have heard of RAW images taken by expensive DSLR cameras, but that's not what we're talking about here. In video, RAW data is a bit different.

A common format for RAW video data is YUV420p. 8 bits per value is implied because a bit depth wasn't specified. You can export to this format using a tool called FFmpeg. I'll describe how YUV420p files are structured before we move on because although we won't be using YUV420p, we will be using YUV420p10LE which is quite similar but more complicated.

In a RAW YUV420p video file, each frame is stored one after the other with no separators. It is simply the Y (B/W image), U and V. Then it repeats for the next frame.

You're probably familiar with the fact that computer data is stored as bytes, and that bytes are 8 bits each. Therefore, a byte can represent any number from 0 to 255. In a readable text file, each letter or other character is simply a number (usually less than 128) that your text reader knows corresponds to a letter or character. But if you've ever opened an image or program in a text editor, you know that they contain weird random characters. That's because the bytes in the file are storing numbers that don't correspond to readable characters.

(A PNG file opened in Notepad)

In a YUV420p video, we first store the B/W image pixel-by-pixel in raster scan fashion, meaning left to right and top to bottom. Note that there is no subdivision here. The raster scan does the whole image line by line.

So you see, raster scanning begins at the absolute top-left pixel, scans to the right until the end, goes down one, jumps back to the left, and goes again.

[You don't need to know the following formulas unless you choose to follow along. If you're just reading then you'll be fine without them]

That will take up (width * height) bytes, since each pixel takes one byte. Then we need to store the U and V images. Well, since the width and height of those are 1/2 the dimensions of the B/W image, the data for U and V will each be ((width / 2) * (height / 2)) bytes. Since we must copy U and V like this, that will be 2 * ((width / 2) * (height / 2)) bytes.

So now we have (width * height) + (2 * ((width / 2) * (height / 2)) bytes of data for just one frame. With an image size of 960x540, that comes out to 777,600 bytes total.

So for each frame in our YUV420p RAW video, the first 518,400 bytes will be the B/W image, the next 129,600 bytes will be the U image, and the remaining 129,600 bytes will be the V image.

Understanding YUV420p10le RAW image data

YUV420p only handles values from 0 to 255 because it's an 8-bit format. Since we want to work with 10-bit image data, we need a way to store values from 0 to 1023. This is where YUV420p10le comes in.

Since files are stored as bytes and bytes are multiples of 8 bits, we can't easily store 10-bit values because we'd have a fraction of a byte left over after almost every 10-bit value we write. This is why YUV420p10le stores the 10-bit values inside a larger 16-bit value. Since 16-bit numbers can range from 0 to 65,535 it will easily store our 0-1023 values.

So now we have to store 2 bytes for each value (16 bits = 2 bytes). The file structure stays the same so our file size doubles to 1,555,200 bytes or 1.48 MiB.

By now you may have guessed that having a higher range of available values makes for a higher-quality image. In a normal PC image, for example, 8 bits per color channel times 3 channels (RGB) equals 24 bits per pixel overall, or 24.7 million different colors. But what if you only have 256 colors at your disposal? The image will look quite bad.

(The image in only 256 colors. Click to enlarge and see how bad it is)

Videos encoded with 8-bit YUV420p look great. In fact, it's the bit depth that all YouTube videos are encoded with, even the ones that claim to be HDR which is at least 10-bit. But 10-bit video is better because you can express a far greater range of light and dark. Converting from 8-bit to 10-bit increases the range by simply multiplying the values to bring them up to the 10-bit range, but there will not be any increase in actual quality because you can't get information that was never there to begin with.

If that sounds confusing, let's say you have a gradient of 8-bit pixels: {15, 16, 17, 18, 19}. Now if you wanted to convert them to 10-bit, you would multiply each entry by 4 to get {60, 64, 68, 72, 76}. But notice that you don't have any values in-between whole multiples of 4. If you had recorded the video with a 10-bit camera you could have any value in the 10-bit range (theoretically), but since you converted from 8-bit you will only have the 256 different 8-bit levels spread evenly within 1024 total levels.

When you convert an 8-bit video to 10-bit, this is all you're doing.

This means that in our case the only purpose of converting to 10-bit is so we can have 10-bit values for our VP9 encoder to work with.

Opening RAW videos in a hex editor

Let's see what RAW video data looks like in a hex editor. What you're looking at in these hex screenshots are the pixel values at the beginning of the file, the first few pixels in the top row of the B/W image. These are shades of gray expressed as numbers, but if that doesn't make sense then I'll try to explain.

Here is what the 8-bit image looks like in a hex editor:

On the left we have hex digits and on the right, bytes as they'd be seen in a text editor

And this is the image in 10-bit format:

In this 10-bit example, each pixel value takes up 2 bytes.

There is an annoying quirk you have to watch out for on multi-byte numbers. Most computers read and write them as Little Endian which means byte-reversed.

Let's take the example of 255 which is 0xFF in hex. Hex works by using 0-9 and A-F to make a base-16 number system. 2 hex digits make a byte.

If we have a hex byte of 0xFF (255 decimal) and add 1 to make 256, it would "overflow" and we would get 2 hex bytes: 0x01 0x00. On Windows Vista and up, you can try setting Windows calculator to Programmer and use the Hex mode to enter 0100, then click Dec and it will say 256.

So now that we've established that, let's say you wanted to record that number in a file. You know about place value in the decimal system (tens, hundreds, etc) so you'd expect to just write 0x0100 to your file, right? Well, it turns out that due to a computer quirk carried from the 1970's, the computer would prefer it if you did it in reverse as 0x00 0x01. This applies even to longer hex numbers like 0xDEADBEEF. Each pair of hex digits must be written in reverse order; do not write the single digits in reverse order. 0xDEADBEEF should become 0xEFBEADDE, not 0xFEEBDAED.

Since this weird reverse notation is called Little Endian, it makes sense that the normal order you'd expect to use is called Big Endian. Keep in mind that the bytes (pairs of hex digits) are what's reversed, never bits or single hex digits.

Little Endian is what the LE in YUV420pLE stands for.

So now let's set our hex editor to group by Words (Word = 16 bits).

By default, it's set to Little Endian so these hex bytes are reversed to Big Endian for proper viewing.

Let's press Ctrl+1 or go to View -> Display As -> Decimal. Now you can see that the values are all normal human-readable numbers and that they never exceed 940.

[Random trivia, not necessary to continue] Why don't the values exceed 940 when we were supposed to convert to a 0-1023 range? Because video of photographic subjects (ie not computer screen captures) customarily ranges not from 0-255 or 0-1023, but rather from 16-235 or from 64-940 (16-235 times 4).

Now you can see what I mean by shades of gray expressed as numbers.

VP9 ELI5 Part 2

Here's the photo split into its 3 YUV components:
The U and V are the color add-on data. Notice how they are 1/4 the size of the B/W image.

Most video formats, VP9 included, depend on both intra and inter frames. An intra frame is a normal picture while an inter frame is just the difference between the current frame and a previous one. Think of it as only saving what moved in the current frame instead of saving the whole picture again.

Obviously we need to start a video with an intra frame, also known as a key frame, so that we can build future inter frames based on it.

Unlike BMP's, we don't save VP9 images pixel-by-pixel in order. Instead, we divide the image on several levels. The first, highest level of division is into 64x64 superblocks.

64x64 is a very large subdivision size. In MP4 (H.264), the largest size available is 16x16. Notice the difference below:



The superblocks are processed in raster order. That means left to right, top to bottom. In other words, each horizontal line of superblocks is processed from left to right, in order from top to bottom.

Each superblock is then optionally subdivided further. Here are some possible subdivisions:

Notice what subdivisions are allowed. Shown from left to right, we can split a superblock:
  • Vertically to make 16x32,
  • Horizontally to make 32x16, or
  • Four ways to make 4 32x32 blocks

After that, we can split the 32x32 blocks the same way or we could leave them alone. However, we may not further split one of the non-square sizes. Any non-square split is final in VP9. Also note that 4x4 is technically as small as we can get but things get messy at that size so for this tutorial we will not go smaller than 8x8.

This is just the first level of subdivision. The next level is the transform level and is more interesting.

VP9 ELI5 Part 1

Welcome to my VP9 Explain Like I'm 5 series. I'm going to explain the VP9 codec so anyone can understand it and perhaps even create their own encoder.

What is VP9 and why should I care?

First things first, let's establish what VP9 is. It's a video codec created by Google and it's what almost all YouTube videos are saved as on the server. Previously, YouTube videos were all MP4 (specifically, H.264). There were two problems with that. The first is that it's patented and copyrighted, so you have to pay the MPEG group. Second, it takes a lot of data for any given quality level. Mobile consumers aren't the only ones who pay by the gigabyte for their data usage; businesses have to pay for that as well to serve people. Normally video has a tradeoff: you can have higher quality for more data usage, or less data usage with bad quality. VP9, amazingly enough, allows Google to pack higher quality into less data, achieving both benefits at once. Unfortunately, only Google and Netflix use VP9; everyone else still uses MP4.

Video definition

You probably already know it's a series of still pictures called frames. If you took pictures really fast with your still picture camera, you could play them quickly to see a crude video. Video is merely a set of still pictures played one after the other really fast.

What are these pictures composed of?

You may be thinking, "A picture is simply dots (pixels). Each one has an RGB (red, green, blue) color value. I know because I've done pixel art in MS Paint." Well, I've got to tell you, that's not how images work in a video.

To explain this, I'll need to give a brief history lesson. You see, video frames could easily use RGB colors like you might have assumed. However, the US TV system was defined near the beginning of World War II and only carried black-and-white video until the 1950's. So instead of red, green, and blue, the old TV system only carried brightness values (shades of gray). You can approximate that in MS Paint by using the same number for red, green, and blue.

In the 50's, they wanted to add color to TV without breaking compatibility with existing TV's. They weren't willing to go the 2009 route and make everyone buy a new TV. What they did was mix a small amount of color data onto the existing black-and-white signal. That way, old TV's would still decode it while color TV's would add the color info on top of the black-and-white picture.

The FCC had allocated 6 MHz bandwidth for TV channels. The engineers couldn't add too much extra color without the signal bleeding onto other channels, so they devised a system of saving only a low-res copy of the image in color format while saving the full-res image in B/W format. The low-quality color image was enlarged and super-imposed over the B/W image, making good color because the human eye sees more brightness than color.

Even though NTSC is nearly dead, this system lives on in computers as YUV 4:2:0. If you want to read more, check out the YUV Wikipedia page.

Let's get started

For this series, we'll be working with the following 960x540 color photo. In case you didn't know, that's 1/4 the size of 1920x1080.

Credit: Lt. Zachary West, 100th MPAD, Flickr, CC BY 2.0

This wasn't actually 960x540 but I cropped and resized so it would be a nice easy size.

If you want to follow along, you can download the image.

Monday, September 18, 2017

OpenCodecDev Discord channel

I just created a new Discord channel called OpenCodecDev, for the purpose of discussing open-source media codecs. It's public so anyone can join. We also have a voice channel.

Here's what it will look like in the Discord desktop app. I'm under Admin, and Foxx is a newcomer.

Notable features of Discord include:

  • Posting images, whether photos or pasting from PrintScreen.
  • Blog URL's become a block with an excerpt from the post
  • YouTube URL's are expanded into a player in the chat history
  • Unlimited backlog which is stored on the server. You don't miss out on messages that happened while you were logged out, unlike IRC, and no past message is ever lost.
  • The backlog is searchable.
This will be more convenient than the Google group, and it has a better GUI than IRC.

Monday, August 14, 2017

Root-Raised Cosine

Yesterday I finally figured out the root-raised cosine. I've been trying to understand it for about a year. It's very necessary for transmitting PSK signals because merely mixing a square wave with a carrier wave makes a wave with sharp transitions that cause lots of spurious signals. The clean, narrow PSK signals you may have seen are all using the root-raised cosine.

Here is the resource that I found yesterday to explain it properly: Pulse Shaping with raised cosine filters.

I found it confusing at first for two reasons. The first is that I didn't know if Wikipedia's formula was for time or frequency domain, and the second is that I had no idea that the RRC is centered on each PSK symbol, meaning that a time of 0 is the center.

Here is the time-domain formula from the University of Stuttgart's Webdemo (linked above):
[REF] Stephan ten Brink, "Pulse Shaping," webdemo, Institute of Telecommunications, University of Stuttgart, Germany, Aug. 2017. [Online] Available:

Why to use it

If I told you to multiply a square wave with a cosine and sine wave to make a QPSK signal, you'd get a result similar to the top stereo track shown below.

Those are some sharp transitions. This is what I got when I first tried to make my own QPSK signals. It seems well and good, right? We have our digital wave mixed with I (cosine) and Q (sine) to make an IQ signal playable in an SDR program. Well, yes, but there's a slight problem...

[Vertical is frequency, horizontal is time]
This isn't what QPSK is supposed to look like. See all the spurious signals splattering everywhere? Satellites like Inmarsat have neat and narrow QPSK, so why does mine look so bad?

It turns out that we've simply placed a square wave (which is full of harmonics) into the RF spectrum by mixing with a carrier.

Now, notice the bottom stereo track. It is the same QPSK signal, but smoothed out using a root-raised cosine filter.

Notice how narrow it becomes:

The signal is also good enough that Signals Analyzer can lock onto the 80 kBit bitrate:

I initially made the mistake of entering 40000 in the BR (bitrate) field because it's 40 kHz QPSK, and with QPSK the bitrate is twice the symbol rate.

(Below) SA can also lock onto the bitrate of the unfiltered QPSK, which means that although it's undesirable for transmitting, it is nonetheless a valid signal (although I did have to zoom out the bottom-left constellation window a bit).

How to use it

The formula generates "taps", which means an array of values to be used on the signal you want to process. In our case, we multiply the taps by our signal.

Here are the variables:

t: time, in fractions of a second, since the center of the symbol.
T: length of half a symbol, in seconds (1 / (2*symbol rate)). (Why not 1/symbol rate? Pitfall explained below)
alpha: roll-off factor, ranges from 0 to 1 (1=wide, 0=brick wall filter)

To maintain the parameters of the signals shown earlier, let's assume we want a QPSK signal with a symbol rate of 40 kHz (80 kbit/sec) and we'll have it in an IQ file sampled at 1 MHz.

Our variables would be:
t: x/sample rate (in our case, x/1000000). x is the FOR loop variable.
T: 0.0000125
alpha: 0.1 (very high roll-off)

40 kHz is a convenient value since we want an odd number of taps. Since 1,000,000/40,000 = 25, it will take 25 samples to make one symbol and so we need 25 taps.

The center value will be 12 (base 0) or 13 if you prefer base 1. We want to start at 0 so we can do our time values properly, so we want a FOR loop to go backwards from 12 to 0.

for (x = 12; x >= 1; x--) {
    taps[12 - x] = [The formula depicted above, substituting (x/1,000,000) for t]

taps[12] = 1

for (x = 1; x <= 12; x++) {
    taps[12 + x] = [The formula depicted above, substituting (x/1,000,000) for t]

This code will give you 25 taps. Think of it as a matrix with just one column; you would use matrix multiplication to multiply each point by a corresponding point in time on an unfiltered QPSK signal. Just make sure to align this so that the taps begin at the beginning of the QPSK symbol, otherwise it won't filter properly. Here's a crude ASCII drawing of what I mean:

|     Taps       |        |   QPSK   |
|   Matrix       |   *   |       IQ       |
|                   |        |   samples |

Note that both matrices are only ONE symbol long; the taps repeat at the start of each symbol.


The pitfall I was referring to is that 0 is the center. This is what my first RRC taps looked like when I calculated starting from 0:

I mistakenly thought that was the whole filter but it's only the right half. Again, here is the right half of another RRC filter:

And finally, here is the output of this mistake. I applied the right half of an RRC filter to the QPSK starting at the beginning of each symbol, which kept the ends from matching properly. When it's done right, the ends meet perfectly. I eventually found out it needed to be mirrored and applied with 0 being the center of the symbol.

This is why we use 1/(2 * bitrate). If you use 1/bitrate, then the right half will span the entire symbol time when you only want it to span half. With 1/(2 * bitrate), each half will cover half the symbol.

I hope this helped if you had no idea how to program the RRC. Use the comment section below if you have any questions or if I left something out.

Tuesday, August 8, 2017

UTSC v1 Packet Specification

After showing the spec to Foxx, wordsun, and Corrosive, only Corrosive had a suggestion and it was to allow embedding a list of ID's in the packets to facilitate pay-per-view. In other words, broadcasters could include a list of ID's of various decoder boxes so only specific paying viewers can see a channel. This is in contrast to my current method of handling encryption like Wi-Fi, using a single password for the channel. I told him his one-key-per-viewer idea most likely wasn't feasible since the packets need to be small.

Here is a link to the document: utsc_finished_release_r1.txt


The UTSC name and specification are Copyright 2017 Designing on a Juicy Cup. The specification may be freely implemented by anyone for any purpose as long as this copyright notice is displayed in the license. The UTSC name may be used in products implementing this standard as long as attribution is made to Designing on a Juicy Cup.

Monday, August 7, 2017

UTSC v1 Standard Finalized

Since 2016, I've been working on a way to transmit digital TV in the 900 MHz Part 15 band. The main focus is on reliability, because ATSC fails miserably in that department. The second focus is on unlicensed operation, because broadcasting is a near-monopoly.

The format officially consists of a 1 Mbit data stream containing VP9 video at about 900 kbits and Opus audio at 48 kbits. Opus is extremely resilient and can withstand high loss, similar to analog TV's sound. It also sounds amazing at that bitrate. Other services, such as audio or data, could be conveyed as well.

My proposed standard is called UTSC. The acronym means nothing, officially. It is designed to be expansible like WAV, meaning that new features can be added without breaking compatibility with the first receivers. My current research suggests that I can fit 32 channels in the band in any given area.

The standard can accommodate any video codec, resolution, and frame rate in theory, but VP9 960x540 @ 30.000 fps is suggested.

I finalized the standard today and I'm documenting it here as proof that I devised this first. If someone else claims to have been first, you can verify with the Wayback Machine that no site before this date carried this info.

The encoder and air interface are proprietary and will not be released yet. However, I'm planning to release the packet format for public review. I'm submitting it privately to Foxx, Corrosive, and wordsun for a pre-review.

Saturday, June 24, 2017

Velvet Ant vs Ziploc Bag

About a week ago I saw a weird bug walking away from a wood pile. It looked dangerous so I caught it in a ziploc bag. It turned out to be a velvet ant. Its jaws were so powerful that it stretched and nearly punctured the bag when I held it taut. Knowing nothing about velvet ants, I didn't realize that the jaws were the least of my worries. I did not know I had to watch out for a stinger, but thankfully I wasn't stung.

As you'd expect by the bright coloring, an article described the pain of their sting as "life-changing, pray-for-death pain". Here is a YouTube video of someone being stung by one:

Needless to say, I was glad to have caught it in a bag. Eventually the bag was placed under a basket on a table and forgotten.

Then today as I entered the living room I saw a bug running across the table. I thought it was a roach and hit it hard and flung it down so I could get a clear path to kill it. But after getting it onto the floor, I realized with horror that this was not a roach, but the velvet ant! Quickly snatching up an envelope, I put it on top of the retreating wasp (that's what they really are) and delivered one quick blow which instantly killed it. It was running to the edge of the table and if I had entered the living room just 5 seconds sooner or later I would've missed it.

Apparently velvet ants can escape from ziploc bags. Here is a picture of the hole it made:

Monday, June 12, 2017

Faulty Marvel Walkie-Talkies

Recently I had the chance to test some children's walkie-talkies. These are generic blue walkie-talkies that can accept plastic front plates with Marvel characters. The label did not specify the frequency but a quick Google search for the FCC ID, 08KAK-2, revealed that they operate in the 49 MHz band. However, that's not where they actually operate...

I played a song on YouTube while holding down the talk button and this is what I got:

(The vertical bars are my LED monitor)

Apparently, this is an incredibly unstable oscillator that actually operates in the 6 meter ham band.

Because of the waterfall, it was trivial to figure out that this was FM. While the width appears to be around 24 kHz here, it can go up to 75 kHz when you blow the mic.

I'm really surprised that the other walkie-talkie can pick up the signal, considering how the transmitter jumps around not only each time you push talk, but even as you're transmitting.

As a ham (Extra class, by the way), I know I would HATE seeing something like this in the 6 meter band. But since the toys work despite their instability, I would expect any narrow FM in this range to "bleed" into the toy's passband, so kids should hear any hams they're interfering with.

Thursday, March 30, 2017

Huge ice maker output

This post is about computer cooling, primarily for GPU's and hard drives. Before I begin, I wanted to suggest that you enter to win an EVGA GTX 1080 Ti. The contest closes 12 days from now. Full disclosure: I do benefit if you enter.

My computer setup happens to be in a room that the previous owners neglected to insulate, so we don't usually run AC there. That presented quite a problem last summer when my computer's aging fan, even with the dust blown out, couldn't keep up and the computer would keep shutting down for its own protection. Losing work randomly made me eventually devise a system of a well-insulated box containing the computer, a fan, and frozen water bottles. It worked pretty well and the box was about as cold as AC, but it still didn't put out nearly as much consistent cold air as was necessary. Using this over the summer, I observed several problems:
  • The bottles didn't have much surface area
  • It quickly goes from frozen bottle to water bottle with ice core. The water was insulating the ice core, and ice "steals" a lot more energy than cold water possibly could (because of the heat of fusion).
  • None of my freezers could freeze bottles as fast as the computer could melt them. I eventually figured out that if I could freeze enough bottles initially, I could swap them in and out and have enough consistently. I worked out the math to find out how many bottles I would need.
I also noticed that our old freezer, a Kenmore Frostless from 1990, froze bottles slightly faster than the brand-new Frigidaire we got in 2014. Considering the old freezer draws less electricity, I suspect the R-12 is responsible. Naturally, I shifted the bulk of my ice production to the Kenmore.

This past winter (2016-2017), I decided to build a better system before it became necessary, so I would have time to perfect it. Since the ice does most of the work and the water (even though it's near-freezing) is virtually useless, I would need my new system to discard the water. Since freezers' automatic ice makers are much faster than freezing bottles, they would be the source of ice. This also solves the surface area problem since a pile of ice cubes will have a lot more area than a cluster of bottles.

I put together a crude system consisting of a plastic container with a hole in the bottom. I fill it with ice and mount a desk fan on top. The fan takes up about half of the top's area and forces air through all the cubes and out the top again. This system worked well. It produced consistent and very cold air, almost like a small AC unit. There are now only two problems:
  • The freezer's ice maker must be able to keep up
  • A bucket must be kept underneath and emptied periodically
This is a lot better than where I was last summer. Lately it's been cold to mild in South Carolina so I haven't had to put it into "production" use yet. However, the old freezer's ice maker was still too slow. A quick Google search for making ice faster revealed something called "Quick Ice", which is a feature on fancy fridges that uses a fan to blow across the ice maker. At one website they write, "For models with the Quick Ice feature, ice production can be increased by nearly 48% to about 6.2 lbs per day." This did not sound impressive, but I knew it couldn't get worse so I decided to try the idea and see how much I could make. Using my strongest CPU fan and the steel wire used for the Inmarsat antenna, I made a mountable fan setup for the old Kenmore freezer. Of course, I first checked to see exactly where the plastic ceiling rail was, so I wouldn't end up making the wrong mount. Then I mounted it and wired it to a 12-volt external hard drive power supply. I emptied the ice maker's bucket and then ran the ice maker and fan for 24 hours before checking it. When I returned, I was very surprised to have 14.3 LB of ice! That happens to be exactly 3 full ice cream buckets (below).

Here's the freezer's ice bucket before being emptied:

And here are my freezer settings:

At 144 BTU/LB for ice, this comes out to 2059.2 BTU of cooling per day, or 85.8 BTU/hour. Nowhere near an AC unit, but still quite useful for small enclosures. Plus, since I'm not there to get the ice during the night, I can get more BTU's per hour using it only during the hottest parts of the day. If that's 8 hours a day, I would get 257.4 BTU/hour.

This is probably not financially feasible in the long term, but it's still a neat project. I was thinking that if I ever set up my GPU for a remote compute service and was still using ice, my website could have a footer that says, "Our GPU setup is proudly cooled with CFC-12, an energy-efficient refrigerant."