## Meta

base:decoding_bitstreams

# Decoding bitstreams for fun and profit

by lft

This article describes a technique for extracting bitfields from a long sequence of bytes stored in RAM.

As an example application, consider a scroller where the text is a string of 5-bit character codes. The entire text could then be stored as a bitstream, from which you read five bits at a time. But you might save some space if you represent, say, the eight most common characters as 0xxx, and all other characters as 1xxxxx. (This would also give you 40 different characters, rather than 32.) In that case, you'd first want to read a 1-bit field, to differentiate between the two cases. Then you'd read either a 3-bit field or a 5-bit field.

We will discuss how to do this efficiently and elegantly on the 6502. In particular, we will look at a technique that performs the two-stage procedure described above, and even navigates arbitrary decision trees, as part of its normal operation.

The schoolbook application for this kind of routine would be a Lempel-Ziv-Welch decruncher or a Huffman decoder. But anything is possible! For instance, you could use it to parse note events in a playroutine, instructions in a virtual machine, or entropy encoded sound samples.

We will start with a simple design, and then add complexity step by step, also optimising it to the point where the complete decoder is quite devilish to follow.

## From bytes to bits

At the heart of the bitfield decoder is the shifter. This is essentially a mini-buffer of pending bits, represented as a single byte in the zero-page. As we shift out bits, the buffer occasionally becomes empty, at which time a new byte is loaded into it. The clever part is how we represent the shifter. This is an old established technique, but it can be rather baffling when you see it for the first time.

The idea is that the shifter contains (from left to right) 0–7 bits of pending data, followed by a single 1-bit that we'll refer to as the token, followed by zeros. So, the following shifter contains three bits of data (1, 0, 1):

At program start, the shifter is initialised to \$80.

Here is a first attempt at a getbit routine:

```getbit
asl     shifter
bne     done

jsr     getbyte
sta     shifter
sec
rol     shifter
done
; The bit is now in C.
rts```

In order to read a bit from the shifter, we first perform an ASL. Normally, this puts one bit of data in the carry flag, while also preparing the shifter for the next bit. But if the Z flag was set by the ASL, the buffer was in fact empty, and we shifted out the token bit. In that case, we grab a new byte, store it in the shifter, and then ROL to get the first data bit from the new byte. The ROL will also shift in a new token bit.

In practice, it would be slow to call a subroutine in order to fetch new bytes. After all, this will happen for every eighth bit, which is quite often. Instead we'll use some self-modifying code, and keep a pointer to the next byte inside an instruction operand, like this:

```getbit
asl     shifter
bne     done
mod_source
ldx     buffer
inc     mod_source+1
bne     no_new_page

inc     mod_source+2
no_new_page
stx     shifter
rol     shifter
done
; The bit is now in C.
rts```

We're using the X register because we're going to need A for something else soon.

Note that the SEC has now been removed, because carry is already set from the previous token bit. If you want to get philosophical about it, you might say that it's “the same” token bit that gets re-used over and over.

Next, we will rearrange the code to reduce the number of branch-taken penalty cycles. From now on, we must make sure to CLC before calling getbit.

```getbyte
mod_source
ldx     buffer
inc     mod_source+1
bne     no_new_page

inc     mod_source+2
no_new_page
stx     shifter
getbit
rol     shifter
beq     getbyte

; The bit is now in C.
rts```

## From bits to fields

So now we can read individual bits from the stream. Let's pack them together into bitfields!

We could of course call the getbit routine from a loop:

```getfield
; Y contains the requested number of bits

lda     #0
clc
field_loop
jsr     getbit
rol
dey
bne     field_loop

; The bitfield is now in A.
rts```

(This is why we had to preserve the A register during getbit/getbyte.)

But again, subroutine calls are costly, so we'll merge getfield and getbit into a single routine. However, getting a single bit is now slower, because we have to treat it as a field of size one.

```getbit
ldy     #1
getfield
; Y contains the requested number of bits

lda     #0
clc
jmp     field_loop
getbyte
mod_source
ldx     buffer
inc     mod_source+1
bne     no_new_page

inc     mod_source+2
no_new_page
stx     shifter
field_loop
rol     shifter
beq     getbyte

rol
dey
bne     field_loop      ; C is clear

rts```

Note that, because we clear A at the beginning, we don't have to CLC before looping back to `field_loop`.

But we can do better than this! Instead of representing the requested number of bits as an integer in the Y register, we can represent it as a single 1-bit in the accumulator. As we shift new data into the accumulator, the 1-bit gets closer and closer to the MSB, and when it finally falls off the edge, we terminate the loop:

```getbit
lda     #%10000000
getfield
; Position of 1-bit in A represents requested number of bits

clc
jmp     field_loop
getbyte
mod_source
ldx     buffer
inc     mod_source+1
bne     no_new_page

inc     mod_source+2
no_new_page
stx     shifter
field_loop
rol     shifter
beq     getbyte

rol
bcc     field_loop

rts```

This preserves Y and saves two cycles per bit (DEY).

## Two-stage fields

Given the above routine, we are now in a position to implement the scroller scenario described in the introduction. Here is some code to fetch a new character from the bitstream:

```getchar
jsr     getbit
bne     large

; 3-bit character code
lda     #%00100000
jmp     getfield
large
; 5-bit character code
lda     #%00001000
jsr     getfield
clc
rts```

Actually, we can shave off a byte and a pair of cycles by recognising that getfield always returns with carry set: We can safely omit the CLC and do ADC #7 instead.

In more complex scenarios, such as decrunchers, we often need to distinguish between more than two cases. Perhaps we read two bits in order to select between four differently-sized encodings:

Value rangeCoded as Value offset (what to add to x)
0-1 00x 0
2-5 01xx 2
6-21 10xxxx 6
22-149 11xxxxxxx22

Rather than spelling out these four cases as different paths through the code, we can use a table-based approach. This helps keep down the size of the decruncher, which is often very important. It will also enable some more optimisations further down the rabbit hole.

We will use one table for the field widths, and one table for the value offsets.

```getvalue
lda     #%01000000      ; Get two bits.
jsr     getfield
tay
lda     fields,y
jsr     getfield
clc

; 9-bit value returned in A and C.
rts

fields
.byt    %10000000       ; Get one more bit.
.byt    %01000000       ; Get two more bits.
.byt    %00010000       ; Get four more bits.
.byt    %00000010       ; Get seven more bits.

offsets
.byt    0
.byt    2
.byt    6
.byt    22```

Note that in the example, the maximum value returned is 149. Therefore, rather than saying that the result is a 9-bit value, we could simply say that the routine returns with carry undefined, and with an 8-bit result in A. We could then eliminate the CLC, and compensate by subtracting one from each value in the offset table. The reason why we can't do this for 9-bit values, is that the first entry in the offset table would become \$ff, and this would cause values 0 and 1 to instead come out as 256 and 257.

## Decoding with arbitrary decision trees

Consider again our scroller example. Suppose we wish to encode a particularly common character (such as space) using a single bit. We might decide on the following encoding scheme:

Value rangeCoded asValue offset (what to add to x)
0 0 0
1-8 10xxx 1
9-40 11xxxxx 9

To fetch a character now, we start by getting a single bit. Based on the value of this bit, we're either done or we fetch one more bit. Based on this bit, we then either fetch three or five bits.

This algorithm is essentially a tree of decisions, as illustrated by the following flowchart:

We will refer to the rhombus-shaped nodes as branch nodes and the rounded-rectangle nodes as return nodes.

Such decision trees are usually implemented explicitly, as code. But for large trees, the decoder becomes unwieldy. Next up, we'll see how we can represent decision trees more compactly using tables.

In each node of the flowchart above, we first fetch a bitfield (possibly of size zero), and then either:

1. Branch to a different node, or
2. Add a constant and return.

It is time to introduce another decoding trick! So far, the field specifiers (what we put in A prior to calling getfield) have consisted of a number of zeros followed by a single set bit. But the remaining bits have no purpose yet, and they will be available in A when getfield returns, shifted into a position immediately to the left of the fetched bits.

So, if we call getfield with A set to `001ttttt` (t is for tag), we'll get `tttttxxx` back, where x is the fetched bitfield.

The most significant bit of the tag will also be in the sign bit of the status register. Some decrunchers, e.g. Doynamite, use this to determine whether the value returned is complete, or whether it's just the high-byte of a larger value. In the latter case, the low-byte can be grabbed very quickly straight from the byte stream. Essentially, one tag bit is used to differentiate between two cases.

However, in the present technique, we wish to encode a generic decision tree, and for this we'll have to use more tag bits.

(In the following, the word “branch” will refer to branches in the flowchart, not 6502 branch instructions!)

Suppose we put a number on each node in the flowchart. The current node number will be kept in the Y register. From this number, we can deduce (using a lookup table) how many bits to fetch, whether we should branch or return after fetching, and—in case of a branch node—what range of nodes we should branch to.

All of this information can be encoded as a single byte, and placed in the accumulator before calling getfield. As we have already seen, the number of leading zeros determines the field width. They are followed by a single set bit and a tag. We will use the most significant tag bit to keep track of what kind of node we're in. If this bit is clear, we're in a branch node, in which case the remaining tag bits will be used to encode the range of branch targets.

A separate lookup table, also indexed by the current node number in Y, will be used to hold the constants that are added in return nodes.

```decode
ldy     #4      ; Start at node 4, the last node in the table.

decode_loop
; Y represents the current node, and is an index into the field and
; offset tables.

lda     fields,y

; In A, we now have:
;       a number of zero bits, controlling how many bits to fetch
;       a one bit
;       if we are in a return node:
;               a one bit (tag MSB)
;               fill up with zeros
;       if we are in a branch node:
;               a zero bit (tag MSB)
;               tag bits --> first target node (after shift)

; Special exception to the above:
; If we're going to fetch a zero-length field, A is zero.
; Handle that now.

beq     done

; Otherwise, fetch the field.

jsr     getfield

; In A, we now have:
;       a bit indicating whether we are in a branch or return node
;       more tag bits (all zero in case of a return node)
;       the field we just fetched

; Are we in a return node?

bmi     done

; No, this was a branch node. The branch target is in A.

; Note that the target has been constructed automatically by
; concatenating the tag with the fetched bits. So if the tag was 0011
; and we fetched 101, we're going to branch to node 0011101.

tay
jmp     decode_loop
done

clc

rts
fields
.byt    %00000000       ; Node 0: Fetch no more bits.
.byt    %10000001       ; Node 1: Fetch 1 bit, then branch to node 2/3.
.byt    %00110000       ; Node 2: Fetch 3 bits, then return.
.byt    %00001100       ; Node 3: Fetch 5 bits, then return.
.byt    %10000000       ; Node 4: Fetch 1 bit, then branch to node 0/1.

offsets
.byt    0               ; Add constant to obtain range 0-0.
.byt    0               ; Unused (branch node)
.byt    \$80+1           ; Add constant to obtain range 1-8.
.byt    \$80+9           ; Add constant to obtain range 9-40.
.byt    0               ; Unused (branch node)```

A subtlety is that when we return without fetching anything (node 0), the accumulator will be zero before adding the constant. Otherwise, the accumulator will be \$80, and we have to compensate accordingly in the offset table.

The above code was organised for clarity. However, we can rearrange the loop to eliminate the JMP instruction. There's also no need to start by setting up a constant Y, as we could just as well load A directly. Since the first node is always a branch node, we won't be using Y after the fetch, so we can leave it uninitialised. Hence:

```decode
lda     #%10000000      ; Fetch 1 bit, then branch to node 0/1.
decode_loop
jsr     getfield
bmi     done

tay
lda     fields,y
bne     decode_loop
done
clc

rts

fields
.byt    %00000000       ; Node 0: Fetch no more bits.
.byt    %10000001       ; Node 1: Fetch 1 bit, then branch to node 2/3.
.byt    %00110000       ; Node 2: Fetch 3 bits, then return.
.byt    %00001100       ; Node 3: Fetch 5 bits, then return.

offsets
.byt    0               ; Add constant to obtain range 0-0.
.byt    0               ; Unused (branch node)
.byt    \$80+1           ; Add constant to obtain range 1-8.
.byt    \$80+9           ; Add constant to obtain range 9-40.```

The CLC at `done` can be removed if we adjust the offset table: We subtract one from each table entry that corresponds to a return node where a non-zero-sized field was fetched.

## Putting it all together

Cramming an arbitrary decision tree into the field table is all very nifty, and it keeps down the size of the decoder considerably. But what about performance? Surely, putting a flowchart in a table can't be faster than simply coding it with explicit branch instructions?

But as a consequence of the table-driven design, there is now a great optimisation opportunity staring us in the face: We're down to a single call to the getfield routine, and that means we can inline it!

```decode
lda     #%10000000      ; Fetch 1 bit, then branch to node 0/1.
clc
jmp     decode_loop
getbyte
mod_source
ldx     buffer
inc     mod_source+1
bne     no_new_page

inc     mod_source+2
no_new_page
stx     shifter
decode_loop
rol     shifter
beq     getbyte

rol
bcc     decode_loop

bmi     done

tay
lda     fields,y
clc
bne     decode_loop
done
; Carry will be set if we got here via the BMI, i.e. after fetching a
; non-zero-sized field. Compensate in the table.

rts

fields
.byt    %00000000       ; Node 0: Fetch no more bits.
.byt    %10000001       ; Node 1: Fetch 1 bit, then branch to node 2/3.
.byt    %00110000       ; Node 2: Fetch 3 bits, then return.
.byt    %00001100       ; Node 3: Fetch 5 bits, then return.

offsets
.byt    0               ; Add constant to obtain range 0-0 (Carry clear).
.byt    0               ; Unused (branch node)
.byt    \$7f+1           ; Add constant to obtain range 1-8 (Carry set).
.byt    \$7f+9           ; Add constant to obtain range 9-40 (Carry set).```

Indeed, with such a flexible routine, one might even be able to drive all decoding from a single call site, and thus to inline the call to the decoder itself. For a real-world example of this, please have a look at the decruncher in Spindle 2.1.

## A final touch

The code is already looking rather streamlined, but let's top it off with one more optimisation: We can get rid of two cycles for each step through the decision tree, by eliminating the CLC right before branching back to `decode_loop`.

The following trick is only possible if, for each node, the number in the field table is either zero (for a zero-size fetch) or strictly larger than the node number. Many decision trees have this property, because node numbers are small integers, while numbers in the field table tend to be large. If not, it may be possible to fix it by rearranging the node numbers.

The idea is to access the table a little differently: Instead of simply loading from it, we perform an ADC. Naturally, we then have to compensate in the table, by subtracting from each element the node number (which happens to be in A at the time of the addition) and 1 (for the carry flag, which is set).

With that, we are ready for the final version of the decoder. It is listed below in the form of a subroutine, but, as mentioned earlier, it should be inlined for maximum performance.

```decode
lda     #%10000000      ; Fetch 1 bit, then branch to node 0/1.
clc
jmp     decode_loop
getbyte
mod_source
ldx     buffer
inc     mod_source+1
bne     no_new_page

inc     mod_source+2
no_new_page
stx     shifter
decode_loop
rol     shifter
beq     getbyte

rol
bcc     decode_loop

bmi     done

tay
bne     decode_loop     ; Carry is clear when branching.
done
; Carry is set.

rts

fields
.byt    %00000000-0-1   ; Node 0: Fetch no more bits.
.byt    %10000001-1-1   ; Node 1: Fetch 1 bit, then branch to node 2/3.
.byt    %00110000-2-1   ; Node 2: Fetch 3 bits, then return.
.byt    %00001100-3-1   ; Node 3: Fetch 5 bits, then return.

offsets
.byt    \$ff             ; Add constant to obtain range 0-0.
.byt    0               ; Unused (branch node)
.byt    \$7f+1           ; Add constant to obtain range 1-8.
.byt    \$7f+9           ; Add constant to obtain range 9-40.```

## Conclusion

We have seen how to extract bitfields from byte sequences stored in RAM, using a highly efficient technique that is capable of navigating arbitrary decision trees as part of the decoding process.