======= Advanced optimizing =======
In addition to the tutorials about speedcode and its generation i want to show some other possibilities to save some cycles and thus speed up your code. Therefore i try to give you some triggers on common situations. Feel free to add some more examples.
====== Branches and conditional code blocks ======
Branches take 2 cycles if not taken, and 3 cycles if taken. Having this in mind, we can quickly save one cycle by choosing our branch wisely.
...
;some code
bcs +
;return
rts
+
;continue
...
This can be better written as:
...
;some code
bcc +
;continue
...
+
;return
rts
So always see that the block that is executed many times is preferred and is not wasting time, while the more expensive things should be moved to the block that happens less often (or even happen outside the loop). Even better, when you can combine branches by putting the decision to the end of a code block and thus save a whole branch:
-
...
;some code
bcc +
;continue
...
jmp -
+
;return
rts
The above can happen sometimes, and might be hard to avoid in certain cases, but often a closer look reveals that it could also work like this:
-
...
;some code
;continue
bcs -
;return
rts
Also, if you need to load a register depending on some branch, you might be able to save some cycles. Imagine you have the following to load Y depending on the state of the carry:
cmp $1000
bcs +
ldy #$00
jmp ++
+
ldy #$01
++
This can be solved in less cycles and less memory:
ldy #$01
cmp $1000
bcs +
ldy #$00
+
or:
cmp $1000
lda #$00
rol
tay
or if Y shall either be $80 or $00:
cmp $1000
arr #$00
tay
For saving space (not cycles, in fact this is pretty expensive!), BIT is your friend as you can "jump over" a one or two byte command via a BIT instruction. Consider this code:
beq +
lda #$04
sta somewhere
rts
+ lda #$05
sta somewhere
rts
This can be reduced using a BIT instruction. The opcode for BIT is $2c:
beq +
lda #$04
.byte $2c
+ lda #$05
sta somewhere
rts
This way, the processor will see a bit $05a9 after the lda #$04. The lda #$05 is not executed as it is part of the BIT command. And the BIT command does not change any registers, it only sets some flags. This can be stacked endlessly:
lda #$04
.byte $2c
foo1 lda #$05
.byte $2c
foo2 lda #$06
.byte $2c
foo3 lda #$07
...
sta somewhere
rts
In much the same way, also one byte commands can be "ignored" using the opcode for BIT zeropage, $24.
====== Determining block number ======
When doing graphics usually the 8 pixel block restrictions apply, so it is a good thing to break down code to block size and execute blocks of code per block. For that reason you would need the number of full blocks that you handle. There's different approaches of how to get there. Actually all you have to perform is (x1 & $1f8)) - (x2 & $1f8)) / 8 (always assuming that x1 is > x2)
So what you can do is the approach (x1 & $1f8) / 8 - (x2 & $1f8) / 8. As both components are divided by 8, the & $1f8 can be dropped as the bits vanish anyway. But we are limited to 8 bit then:
ldx x1low
ldy x2low
lda div8,x
sec
sbc div8,y
However if x1 and x2 are 9 bit numbers we get into trouble. Imagine x1 is $100 and x2 is $f0? So if we'd just subtract the lowbytes we would end up in doing a $0 - $1e what would result in $e2 blocks what is defenitely wrong. Correct would be in fact to perform $20 - $1e. So here comes the approach to handle also 9 bit subtractions at no extra costs by just putting the shifting to the end:
lda x1low
ora #$07
sec
sbc x2low
lsr
lsr
lsr
As you see first of all the lower 3 bits of x1 are all set to avoid an underflow in the lower 3 bits and what would simulate the same as a (x1 - (x2 & $1f8)). However performing an and-operation on the second operand would force us to do the operation beforehand and store the result somewhere. So we first of all turn things around and avoid the underrun on the lowest 3 bits not by masking them out, but by maximizing those bits what gives the same result. When done so we would subtract x2 and finally shift 3 times to divide by 8.
Now going through that example with our previous 9 bit values we see that the following happens:
$00 | $07 = $07
$07 - $f0 = $17
$17 >> 3 = $02
So no matter what lower three bits of x2 would be set, we would be save from an an underrun and end in values from $10 up to $17. Finally when shifting now, all is fine. As the subtract wraps around at the right bit, we end up with a result of $02. However keep in mind that the distance between x1 and x2 must not be greater than $ff.
Alternatives would be: (-(x1 & $f8) + x2) or (-x1 | 7 + x2)
lda x1
and #$f8
eor #$ff
sec
adc x2
...
lda x1
eor #$ff
sec
ora #$07
adc x2
====== Indexing and counting ======
Indexing makes life easier, but implies increment/decrement of the index and checking against some endvalue. Expensive!
===== Absolute indirect addressing =====
The absolut inidrect addressing is, what we often want for picking values from a table or for e.g. the screen. But the instruction set of the 6502 only allows us to do the indirect adressing indexed, so usually the index is in our way and we end up in doing stuff like:
sty y+1
ldy #$00
lda (screen),y
y ldy #$00
;or
stx x+1
ldx #$00
lda (screen,x)
x ldx #$00
If we now imagine having static or predictable values for X we could forgo on saving the register, but incorporate it into the pointer. For the case of Y this is even more expensive, but in the case of X we can easily do:
ldx #$bf
lda (<(screen - $bf),x)
Though this way of indirect addressing costs 6 cycles it saves the overhead of storing and restoring X. There's no need to set X to zero, but you can just subtract X from the screen-value. To avoid an underrun we therefore take the lowbyte of the result by using '<'. As the 6502 has the $ff-wrap around bug, values will not fetched from above ($ff) or below ($00), so also no need to bother about getting out of bounds.
===== Comparisons/Faster loops =====
The compare at the end of a loop influences the carry flag, what can be rather annoying if you calculate in a loop (but can also come handy if you need to clear or set the carry in each round)! So sometimes we are really glad to get rid of it and thus save the compare (plus maybe even a CLC/SEC). Just imagine the following code:
ldy #$18
-
sta $1000,y
dey
cpy #$10
bne -
Wouldn't it be better like this?
lda #start
sec
sbc #end
;calc delta-y
tay
lda #end
;add offset beforehand
sta tgt+1
-
tgt sta $1000,y
dey
bne -
Now we can count down to zero, and there is no need for any comparison beforehand, by simply adding the end-value as a fixed offset to the target and counting down the delta of start and end. However be aware of the fact, that a penalty cycle applies if you use LDA and cross a pageboundary (y+#end >= $100). STA will always use 5 cycles.
===== Using zeropage =====
Storing to zeropage saves one cycle, compared to a store on a 16-bit address. Also it allows us to store X and Y-register directly what saves an additional TXA/TYA and thus A is not clobbered anymore. So think about it, sometimes it is wise to store some data in zeropage. Imagine the following loop:
ldy #$40
ldx #$00
-
txa
sta $1000,y
dex
dex
dey
bne -
But if there is enough space in zeropage, you'd better do:
ldy #$40
ldx #$00
-
stx $80,y
dex
dex
dey
bne -
Even more advantages arise when you place a whole piece of code in the zeropage mostly when selfmanipulating your code. Here's some example that makes this more clear:
;Example 1
my_y = * + 1
ldy #$00
iny
sty my_y
;..do things that clobber y
;Example 2
lda #$00
sta data
lda #$10
sta data+1
data = * + 1
lda $1000,y
...
lda (data),y
First, we can save registers to zeropage with 3 cycles, but can get the value back with only two cycles. So there's one cycles saved whenever we run into scenarios where we need to use a register for more than one purpose. The second part shows that when having code in the zeropage, constructs like the indirect indexed addressing can be omitted if the address is used only once, thus saving again one cycle. Further more one can easily reuse the set up address by referencing the label (data) later on with an indirect indexed opcode. Something you can't do easily when running code outside the zeropage.
===== Using the stack =====
PHA and PLA push/pull the accumulator to the stack and decrement/increment the stack-pointer for free. So if we need to sequentially store a bunch of values somewhere, this could be your option:
;save stack pointer
tsx
stx $02
;set to our target ($0180)
ldx #$80
txs
lda #$00
clc
-
pha
adc #$10
bcc -
;restore stackpointer
ldx $02
txs
Further advantage of this method is, that we have an additional register free, as it is not used for an index anymore. But be aware! You have to take into account, that you have to store values top-down, as the stack-pointer decreases on every push. The advantage is, that if an interrupt occurs in between, it will not trash your values on the stack, as it pushes its 3 bytes (PC + Status) below your current position. All you need to take care of is, that you don't under-run the stack in case of an interrupt (needs 3 bytes, if you do a JSR in the interrupt-handler, another 2 bytes are needed per level), or trash still valid content in the upper part of the stack.
For reading out your values from stack you can either use pla but much easier via e.g. lda $0100,x
===== Counting with steps greater than 1 =====
Later we will discover to do that also by SBX, but there's also another option to do that easily and being able to use LAX features for the index or even function that we walk along
count = $20
ldx #$00
ldy #$00
-
stx count,y
iny
txa
sbx #-3
cpx #$60
bne -
...
.index lax count
...
do stuff with X and A
...
inc .index + 1
As you see the inc .index + 1 will fetch the value from the next location in zeropage on the next turn Thus we have A and X increased by 3 on each round, all done in 9 cycles, and with the option of destroying x later on.
===== Counting bits =====
As we are on a 8 bit machine, counting from or to 8 occurs quite often. So why not counting bits?
;setup counter
lda #$80
sta $02
-
;do stuff that best use A, X and Y
lsr $02
bcc -
;restore counter
ror $02
This gets even cooler when you are able to use $02 as some bitmask (for e.g. when drawing lines). When using BMI/BPL or BVS/BVC (need then to test bits with BIT however) you might even count to 1, 2, 6 or 7.
===== Run length =====
In unrolled loops the current value of an virtual index can be determined by the code position, thus it is possible to separate the modiyfing part of a loop from the testing part. Let us use an example again to make this more obvious:
lda mask
sta (bmp),y
iny
txa
sbc dy
tax
bcs +
;update
+
lda mask
sta (bmp),y
iny
txa
sbc dy
tax
bcc +
;update
+
...
As you see, there's a nice unrolled loop but for every step we need to reload mask and save/restore our calculation results to the x-register.
If we would now extract the store part we could aggregate the stores:
lda mask
sta (bmp),y
iny
sta (bmp),y
And on the other hand also do the calculations without using the x-register, as A is all free now:
sbc dy
bcs +
;update
+
sbc dy
bcs +
;update
+
...
So when all merged together we would end up with the following code:
sbc dy
bcc one_times
sbc dy
bcc two_times
sbc dy
bcc three_times
...
one_times
lda mask
sta (bmp),y
jmp update
two_times
lda mask
sta (bmp),y
iny
sta (bmp),y
jmp update
three_times
lda mask
sta (bmp),y
iny
sta (bmp),y
iny
sta (bmp),y
jmp update
It is obvious that code size restricts this method a bit as the branches don't reach too far. But even then there's situations where it is worth to spend a long branch construction while still saving cycles.
====== Clobbering registers ======
Registers are very scarce on the 6502, as we have only the accumulator for arithmetic operations and two index-registers. Running out of registers is very expensive, as we need to save and then restore again registers:
...
sta $02
txa
sta $1000,y
lda $02
dey
...
The following gets much faster if you run out of registers:
;before loop, setup y offset
sty tgt+1
...
tgt stx $1000
dec tgt+1
...
This way we avoid clobbering A and we can even get Y free for other use. Although the 6 cycles of the DEC appear expensive, we save 3 + 2 + 3 + 2 cycles + 1 cycle on the unindexed STX now. 5 cycles faster, nice!
====== The carry ======
CLC and SEC can make our additions and subtractions twice as expensive, so always keep track of if the carry is set or cleared and if we can reuse it. Also there are some possibilities to set it for free as a nice side-effect from other instructions. Other instructions leave your carry unclobbered but still lead to the same result.
Watch out for BCS and BCC, if you use them within your code, you have the best evidence on if your carry is set.
**Example 1:**
bcs +
;clc
adc #$10
sec
+
;sec
sbc #$08
...
**Example 2:**
lda #$00
clc
adc #$10
sta $10
;clc carry is still clear, as above addition can never overflow
adc #$10
sta $11
...
Also, sometimes an EOR, ORA or AND will just do the same like an ADC or SBC, but without clobbering your carry, and without regarding its current state.
rol $02
lda $fb
eor #$80
sta $fb
bpl +
dec $fc
+
;carry is still okay here
When inside a loop, you can also use the compare at the end of the loop to set/clear your carry automatically:
sec
-
sbc #$10
dey
cpy #$00 ;sets carry again (upcoming bcs shows that clearly)
bcs -
if you need a cleared carry all the time, then do an incrementing loop and branch on clear.
Also you might want to have a look at this article: [[base:some_words_about_the_anc_opcode|Some words about the ANC opcode]]
If the carry has not the desired state, one can still circumvent a CLC/SEC beforehand by just taking the state of the carry into account:
;sec
adc #$07 ;actually we want to add 8, but carry is set, so 7 is enough
...
;clc
sbc #$07 ;actually we want to subtract 8, but carry is clear, so 7 is enough
Another way to circumvent a wrong carry state is to do the following (and if we are lucky the adc does not overflow and we also save the clc):
;sec
adc value1
clc
sbc value2
;gives value1+1-value2-1 = value1-value2
When we have to set/clear the carry often due to overflow/underflow of the value, depending on the range of your added/subtracted values, it is smart to shift the value beforehand:
clc
lda value1
ora #$80 ;add 128
sbc #$40
adc #$60
sbc #$40
adc #$20
;... still no under/overflow and last carry state can always be reused
eor #$80 ;subtract 128 (can maybe even use and #$7f or even anc to influence carry!)
Sometimes we can substitute a subtraction by a compare. Then we don't even clobber A, but also are not in need to set carry beforehand (but might get the carry set for free):
lda $1000,y
and #$f8
cmp xpos ;substitutes sec + sbc xpos
bcc + ;now we have even a reliable state for our carry in both cases
... some code ...
+
Additions don't need to happen obviously, but can occur implicitely by indexed opcodes. A small example will explain this better:
lda pos
clc
adc offset
tax
lda tab,x
...
;in other words:
ldx pos
offset = * + 1
lda tab,x
...
;zero page indirect y-indexed
offset = $b0
;prerequesite, set up highbyte of pointer
lda #>tab
sta offset+1
;now read from tab + ypos + offset
ldy ypos
lda (offset),y
As you see, pos and offset are also added implicitely by the indexing done by the opcode. This might in some situations perform faster and will avoid clobbering the carry.
Last but not least, you might give [[base:advanced_optimizing#sbx|SBX]] a try which does not care about the state of the carry on a subtraction.
====== Use of immediate values ======
Using immediate values instead of values from memory saves cycles as well. So when inside a loop, think of presetting values before entering the loop, instead of fetching them from mem again and again:
...
lda $02
and mask,x
sta $1000,y
...
Now lets save one cycle in the inner loop:
;before loop
lda $02
sta val+1
...
val lda #$00
and mask,x
sta $1000,y
...
If you now even manage to combine the static value with the mask, you could even save 2 additional cycles and simply do a lda mask,x.
Also you might load a register with an often used value, and reuse that over the whole loop. This comes handy if you want to apply a fixed mask to a huge load of values when using the SAX command.
====== Shifting ======
LSR and ASL is actually a good way to divide/multiply (see also [[signed 8bit divide by 2 (arithmetic shift right)|here]]) by two when handling unsigned numbers, the 2 cycles don't hurt, but might, when doing excessive shifts. When we want to relocate a bit within a byte we can do that from both sides. Imagine we want to get bit 7 down to bit 0:
lda #$80
lsr
lsr
lsr
lsr
lsr
lsr
lsr
Quite some shifts needed, but why not using the possibility of wrap-arounds?
lda #$80
asl
rol
Another nice trick to transform a single bit into a new value (good for adding offsets depending on the value of a single bit) offset is the following:
lda xposl ;load a value
asr #$01 ;move bit 1 to carry and clear A
bcc +
lda #$3f ;carry is set
+
adc #stuff ;things will work sane, as offset includes already the carry
As you can see we have now either loaded $00 or $40 (carry!) to A depending on the state of bit 0, that is ideal for e.g. when we want to load from a different bank depending on if a position is odd or even. As you see, the above example is even faster than this (as the shifting always takes 6 cycles, whereas the above example takes 5/6 cycles):
lda xposl
asr #$01
ror
lsr
adc #stuff ;things will work sane as carry is always clear (upper bits are masked out)
If you want to do the same as above but have to preserve the carry (maybe because you are just preparing to calculate the highbyte and have a carry from the previous lowbyte calculation) then you can use this variant:
lda xposl
and #$01
beq +
lda #$40
+
adc #stuff ;will now include the carry
The given examples show, that asr/arr is nice to combine shifting with masking. So here is another nice example (thanks to Peiselulli) to easily fetch 2 bits from a byte:
lda %10110110
lsr
asr #$03*2
This will mask out and shift down bits 2 and 3. Note that the mask is applied before shifting, therefor the mask is multiplied by two.
When you intend to copy a certain bit to the carry, you can do that within 2 cycles by comparing. However the bit must be the most significant bit being used:
ldx #$1f
cpx #$10 ;-> carry is set if bit 4 is set, else it is clear.
arr #$00 ;A = A & 0, ror, so bit 4 is now bit 7
The advantage is, that you can move bits also across registers and are not restricted to the accumulator only.
When shifting, we handle 9 bits, as the bit falling out at one edge of the byte will be the new carry, and the old carry will be shifted in. This will introduce a gap of one bit, when we wrap around bits:
lda #%11111111
clc
rol
rol
;-> A = %11111101
; ^
; gap :-(
To avoid this behavior there's several ways around it:
lda #%11111111
asl
adc #0
...
lda #%11111111
anc #$ff
rol
...
lda #%11111111
cmp #$80
rol
This way bit 7 is copied to carry first and then shifted in on the right end again.
If you deal with chars, you often need numbers divided by 8, this also includes numbers bigger than 8 bits, as the screen is 320 pixels wide. If you include clipping you might even span over a bigger range.
An easy way to shift 11 bits to a final 8 bit results without having to deal with two different bytes being shifted independently, is the following:
lda xhi ;00000hhh
asr #$0f ;000000hh h - might also be a lsr in case if no upper bits need to be clamped
ora xlo ;lllll0hh h
ror ;hlllll0h h
ror ;hhlllll0 h
ror ;hhhlllll 0
As the least significant 3 bits are lost during the shift anyway, we place the bits for the highbyte there and rotate them back in on the left side, so all we need to shift then is a single byte. To make the rotation work, the highbyte needs to be preshiftet by one before the lowbyte is merged in. The only prerequisite of this method is, that the lowbyte must have least significant three bits cleared.
====== Jumpcode ======
If you want to fetch a certain bitpair from a byte (for e.g. fetch the value of a multicolor pixel) you need to invest a variable amount of shifts. To get the shifting more dynamic we can use jumpcode, a trick that not only applies here, but also to various other loops:
lda xposl ;load some value
anc #$06 ;either jump 0, 2, 4 or 6 bytes far, clears carry to force upcoming branch
sta .jt1+1 ;setup jump
lda (zp),y ;load value to be shifted
.jt1 bcc *+2 ;jump into code with right offset
;x = 0
lsr
lsr
;x = 2
lsr
lsr
;x = 4
lsr
lsr
;x = 6
and #$03 ;finally mask out desired bits
Looks like a bunch of code, but if you imagine that the equivalent code would be the following, you see that the overhead setting up the jump is nearly the same, not to mention the saved loop-overhead of another 5 cycles per step if you use jumpcode:
lda xposl ;load some value
and #$06 ;mask out bits
eor #$06
tax
lda (zp),y ;load value to be shifted
cpx #$00
beq +
-
asl
asl
dex
bne -
+
and #$03
If you want to see other examples of jumpcode, have a look at the code presented in [[Filling the vectors]], where it is also used to set the start and endpoint of an unrolled loop.
Even more fun is doing jumpcode via an indirect jump. Imagine you have an unrolled loop of speedcode and want to find out the point where to enter your speedcode depening on an index. Usually the size of one loop is not a power of two and thus things get complicated, even worse when the speedcode segments we jump to have a variable size. Here comes the solution to seize the pain:
;first, set up an aligned table of pointers into our speedcode
!align 255,0
dest
!word speedcode_entry1, speedcode_entry2, speedcode_entry3 ...
enter
tya
asl ;shift by two, we use pointers
sta jump+1
jump jmp (dest)
;this way we simulate a jmp ($xxxx),y where the index may range from $00..$7f
If the targets to be reached are all within the same page, then of course a normal jump with manipulated lowbyte would do as well and save 2 cycles on teh jump.
More examples on this can be found here: [[dispatch_on_a_byte]]
====== Combining bits / Substitute logical operations ======
Mostly with 4x4 effects, but also in other cases you might wish to combine two numbers into a single byte, like e.g. high- and lownibble. You can usually do that like:
lda lownibbles,x
ora highnibbles,x
sta target
This will for e.g. merge a $c0 and $03 to $C3
This can however also been done the other way round by using an AND operation. Therefore just the unused bits have to be set to 1 instead of 0
So we would then result in for e.g. $cf and $f3 as high- and lownibble, if we now combine them by AND we also get $c3 as result. And then, you might think? Well, there's lots of illegal opcodes that include and operations, but just a few with ora/eor. Most of all, now we can make excessive use of the SAX command and store/manipulate low/highnibbles in A and X seperatedly. So the above code would now be:
lda lownibbles,y
ldx highnibble,y
sax target
So far we still use the same amount of cycles, but we are now able to reuse either X or A for the next combinations. In case of using ORA we would first need to mask out the unwanted nibble again to do so. This comes in handy for things like texture mappers, and has been used in the 50fps sphere mapper in coma light 13.
On other occasions you want to mask out for e.g. the lower 3 bits to act as a counter going from 0 to 7 or 7 to 0. Usually when you want the counter inverted this looks like:
lda x1
and #$07
eor #$07
tax
Though it can take more cycles, this method might be handy if you can reuse values and/or have the carry already set:
lda x1
eor #$ff ;tay + iny gives you for e.g. -val as an extra
and #$07
tax
;or
lda #$ff ;would also work with A = 0 and carry cleared, so we could save on a few opcodes if so
sbc x1 ;tay + iny gives you for e.g. -val as an extra
and #$07
tax
;or if a = $ff
;thanks to Kabuto to point that use of lax out in a comment on csdb
lax #7 ;X and A = 7
eor x1 ;A = x1 eor 7
sbx #$00 ;X = A and 7
;or if you go for the distance between x1 and x2
lda x1
ora #$07 ;(x1 | 7 - x2) / 8 gives you for e.g. the number of blocks (assumed that x2 < x1)
sbc x2
tax
If you want to clear certain bits and are using a add/subtract operation anyway, you actually can combine both if the bit to be cleared is set before in any case:
bmi +
;...
+
;bit 7 is set, clear it
and #$7f
sbc #$01
;can also be:
bmi +
;...
+
sbc #$81 ;-> clears Bit 7 and subtracts 1
In the same way this method can also be used to set bits (for e.g. with adc #$81) or to toggle bits.
When masking out bits, SAX or SBX is often a good choice.
lax value
and #%11110000
sta highnibble
After this we need to restore from X to mask the lower bits, better then another lda value, but still.
lda value
ldx #%11110000
sax highnibble
This looks already better, we have the original value still in A and can do another mask operation.
lax value
eor #%000011111
sax highnibble
This looks even better, we can reuse X here and also A still contains the original bits, but in an inverted manner. So this opens up more options of reusing the original value at more than one register which gives potential for further savings.
This was spotted in Krill's loader when doing lookups on the GCR tables, so thanks to Krill here :-)
====== Illegal opcodes ======
Now let me show you some nice situations where illegal opcodes can save you a few cycles by combining some mnemonics in a single command. Note that the examples i give are not the only situations where you can make use of illegal opcodes, but they might give you some hint on how they can be used. Also you might have noticed that i used some of the illegal opcodes in my previous examples, so here we go.
===== LAX =====
Loads A and X with the same value. Ideal if you manipulate the original value, but later on need the value again. Instead of loading it again you can either transfer it again from the other register, or combine A and X again with another illegal opcode.
lax $1000,y ;load A and X with value from $1000,y
eor #$80 ;manipulate A
sta ($fd),y ;store A
lda #$f8 ;load mask
sax jump+1 ;store A & X
Also one could do:
lax $1000,y ;load A and X with value from $1000,y
eor #$80 ;manipulate A
sta ($fd),y ;store A
txa ;fetch value again
eor #$40 ;manipulate
sta ($fb),y ;store
If you can afford clobbering A you can also load X with additional addressing modes (remember that ldx ($xx),y is not available) like:
lax ($fb),y
;... instead of
lda ($fb),y
tax
Even more fancy shit can be done by lax ($xx,x), here you can implement a lookuptable that needs the previous value of x as input. So basically you can do any x = f(x); term.
Actually you can use LAX also with an immediate value, but it behaves a bit unstable regarding the given immediate value. However when simply doing an LAX #$00 you are fine.
lda $xxxx,y is not available as 8 bit version, so an lda $xx,y is not possible. With lax $xx,y there is howeever a way to imitate a lda $xx,y at the cost of destroying x.
===== SAX/SHA =====
This opcode is ideal to setup a permanent mask and store values combined with that mask:
ldx #$aa ;setup mask
lda $1000,y ;load A
sax $80,y ;store A & $aa
Also there's a nice example of writing out a row of numbers faster than you might think:
;write values form 0 .. 7 to screen
ldx #$0e
lda #$01
sax $0400 ;write $01 & $0e == 0
sta $0401 ;write $01
lda #$03
sax $0402 ;write $03 & $0e == 2
sta $0403 ;write $03
lda #$05
sax $0404 ;write $05 & $0e == 4
sta $0405 ;write $05
lda #$07
sax $0406 ;write $07 & $0e == 6
sta $0407 ;write $07
as you see, this wastes just one byte more than an unrolled loop as in the upcoming example, but saves 2 cycles on every second byte written.
This trick also helps when you need to switch 8 sprite pointers in a line. Usually one could just set up 2 different pointers at two different screens and switch 8 sprite pointers via $d018. But this is not applicable if your effect renders stuff into the screen or if you are doing even double buffering with screens. Here you have to fall back to writing 8 new sprite pointers in less then 44 cycles (63-19), but then also cope with possible jitter that is added. Preloading registers will then only help if you have a stable enough irq position, for e.g. achieved by a double irq. Here this fast writing of 8 values helps.
The only thing to take care is, that #sprites is an even number (for odd numbers the sax and sta statements need to be swapped and y should be used for writing the last value). Now we are able to write 8 sprite pointers in 38 cycles.
ldx #$00
stx $0400
inx
stx $0401
inx
stx $0402
inx
stx $0403
inx
stx $0404
inx
stx $0405
inx
stx $0406
inx
stx $0407
An y-index version of //SAX// exists in the illegal opcode //SHA//. However it also adds the highbyte+1 of the used address as a mask to the value written. So in most cases you are restricted to certain destination addresses.
===== SHX/SHY =====
When storing to zeropage you can also store the y- and x-register with an index in a fast and comfortable way. But often you will need the zeropage for other things. Sadly the instruction set of the 6510 is not orthogonal and thus this features are not available for 16 bit addresses. You can however workaround that nuisance by using SHX or SHY, but have to cope with the H component in it, as the stored values are anded with the highbyte of the destination address + 1. So most of the time you might want to store to $fexx to not run into any problems. In case you have to apply an additional static mask, or if you just need certain bits of the stored values, you can of course choose a different address. If you start crossing a page with the index, the behaviour of this opcode changes radically. In those cases the Y-value becomes the highbyte of the address the values is stored at.
Want some example?
sin_p = $02
ztab = $fe00
lax (sin_p),y ;load position in ztab
shy ztab2,x ;store line num in ztab
iny ;next line
...
===== ASR =====
Whenever you need to shift and influence the carry afterwards, you can use ASR for that, and if you even need to apply an and-mask beforehand, you are extra lucky and can do 3 commands by that:
asr #$fe ;-> A & $fe = $fe -> lsr -> carry is cleared as bit 0 was not set before lsr
... same as ...
and #$fe
lsr
===== ARR =====
ARR ands the accumulator with an immediate value and then rotates the content right. The resulting carry is however not influenced by the LSB as expected from a normal rotate. The Carry and the state of the overflow-flag depend on the state of bit 6 and 7 before the rotate occurs, but after the and-operation has happened, and will be set like shown in the following table (thanks to doynax for correcting me):
^ Bit 7 ^ Bit 6 ^ Carry ^ Overflow ^
| 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 |
| 1 | 1 | 1 | 0 |
So ARR is quite similar to ASR and is perfect for rotating 16 bit stuff:
lda #>addr
lsr
sta $fc
arr #$00 ;A = A & $00 -> ror A
sta $fb
... is the same as ...
lda #>addr
lsr
sta $fc
lda #$00 ;set to #$01 if you want to leave with a set carry
ror
sta $fb
Note: When using ARR value #$00 or #$80 do the trick to influence the state of the carry after operation, but the later only if A has bit 7 set as well, so be careful here). However this uncommon behaviour enables another trick. Due to the fact that the carry resembles the state of bit 7 after ARR is executed, one can continuously shift in zeroes or ones into a byte:
lda #$80
sec
arr #$ff ; -> A = $c0 -> sec
arr #$ff ; -> A = $e0 -> sec
arr #$ff ; -> A = $f0 -> sec
...
lda #$7f
clc
arr #$ff ; -> A = $3f -> clc
arr #$ff ; -> A = $1f -> clc
arr #$ff ; -> A = $0f -> clc
===== SBX =====
Finally i found a good use for the SBX command. Imagine you have a byte that is divided into two nibbles (just what you often use in 4x4 effects), now you want to decrement each nibble, but when the lownibble underflows, this will decrement the highnibble as well, here the sbx command can help to find out about that special case:
lda $0400,y ;load value
ldx #$0f ;setup mask
sbx #$00 ;check if low nibble underflows -> X = A & $0f
bne + ;all fine, decrement both nibbles the cheap way, carry is set! \o/
sbc #$f0 ;do wrap around by hand
sec
+
sbc #$11 ;decrement both nibbles, carry is set already by sbx
... can be substituted by ...
lda #$0f ;set up mask beforehand, can be reused for each turn
sta $02
lda $0400,y
bit $02 ;apply mask without destroying A
bne +
clc
adc #$10
+
sec ;we need to set carry :-(
sbc #$11
A second case in which to use SBX is in combination with LAX, for example when doing:
lda $02
clc
adc #$08
tax
that can be easily sustituted by:
lax $02 ;A = X = M [$02]
sbx #$f8 ;X = (A & X) - -8
Even multiple subtractions can be made if A stays untouched, and it is also sufficient if A is $ff to disable the AND component of the opcode:
lda #$ff
ldx val
sbx #$07
... some code that does not clobber A,X
sbx #$08
...
That also means, that we can easily implement a counter in X:
txa
clc
adc #value
tax
;can now be:
txa ;A = X
sbx #-value ;X = A & X -> X = X and then X = X - - value -> X = X + value
;voila, new value is in X, all done in 4 cycles
;or even 2 cycles if A stays $ff as in the example above:
lda #$ff
sbx #-value
...
sbx #-value
...
So we saved 4 cycles here, as the state of the carry is of no interest for the subtract done by SBX, what is one of its big advantages. Thus we could also fake an ADD or SUB with that command. The and-operation is not needed here, but does not harm. If there's use for it, just let A or X be loaded with the right value for the and-mask.
Another trick that makes use of the SBX command is the negation of a 16 bit number (thanks to andym00!):
lax #$00 ;should be save, as #$00 is loaded
sbx #lo ;sets carry automatically for upcoming sbc
sbc #hi
One might also think of extending this trick to negate two 8 bit numbers (A, X) at a time.
Furthermore, the SBX command can also be used to apply a mask to an index easily:
ldx #$03 ;mask
lda val1 ;load value
sbx #$00 ;mask out lower 4 bits -> X
lsr ;A is untouched, so we can continue doing stuff with A
lsr
sta val1
lda colors,x ;fetch color from table
;instead of (takes 3 cycles more)
lda val1
and #$03
tax ;setup index
lsr val1 ;A is clobbered, so shift direct
lsr val1
lda colors,x
The described case makes it easy to decode 4 multicolor pixelpairs by always setting up an index from the lowest two bits and fetching the appropriate color from a previously set up table.
In certain cases sbx might even help out to eor X with a value:
ldx #$07
txa ;make A = X so that and-component of sbx does not destroy X
sbx #$ff ;-> X = X eor $ff
This would also work with other values, for e.g. with $80 to toggle the MSB of X.
===== DCP/ISC =====
Thanks go to LHS/Ancients Pledge Inc./Padua for the upcoming example and hints on this illegal opcode:
x1 !byte $7
x2 !byte $1a
;an effect
-
dec x2
lda x2
cmp x1
bne -
can be written as
;an effect
-
lda x1
dcp x2 ;decrements x2 and compares x2 to A
bne -
Another good use can be made if you want to do a inc/dec ($xx),y what is actually not available. So here isc/dcp ($xx),y will help you out, as it is also available for the indirect y adressing mode.
f.e.:
ldy #..
lda (zp),y
clc
adc #..
sta (zp),y
bcc +
iny
isc (zp),y
+
or
ldy #..
lda (zp),y
sec
sbc #..
sta (zp),y
bcs +
iny
dcp (zp),y
+
For decrementing a 16 bit pointer it is also of good use:
lda #$ff
dcp ptr
bne *+4
dec ptr+1
;carry is set always for free \o/
Under certain circumstances the command can be also misused to decrement a value and set/clear the carry for free while doing so.
sec
lda scr
sbc #$28
sta scr
lda #$bf
bcs +
dcp scr+1 ;sets carry for free
+
adc bmp
sta bmp
lda bmp+1
sbc #$01
sta bmp+1
;... or
lda dst
sbc #$08
sta dst
bcs *+4
dcp dst+1
;sets carry for free as long as dst+1 is <= $f8
So here the carry is set for free as long as the content of scr+1 is lower than $bf.
More examples, (also using ISC) can be found [[base:advanced_optimizing#incrementing_related_pointers|here]].
===== SRE/SLO =====
SRE shifts the content of a memory location to the right and eors the content with A, while SLO shifts to the left and does an OR instead of EOR.
So this is nice to combine the previous described 8 bit counter with for e.g. setting pixels:
lda #$80
sta pix
...
lda (zp),y
sre pix ;shift mask one to the right and eor mask with A
bcs advance_column ;did the counter under-run? so advance column
sta (zp),y
...
advance_column
ror pix ;reset counter
lda zp ;advance column
;clc ;is still clear
adc #$08
sta zp
bcc +
inc zp+1
+
lda (zp),y
ora #$80 ;set first pixel
sta (zp),y
===== DOP/TOP =====
For saving space (not cycles, in fact this is pretty expensive!), BIT is your friend as you can "jump over" a one or two byte command via a BIT instruction. Consider this code:
beq +
lda #$04
sta somewhere
rts
+ lda #$05
sta somewhere
rts
This can be reduced using a BIT instruction. The opcode for BIT is $2c:
beq +
lda #$04
.byte $2c
+ lda #$05
sta somewhere
rts
This way, the processor will see a bit $05a9 after the lda #$04. The lda #$05 is not executed as it is part of the BIT command. And the BIT command does not change any registers, it only sets some flags. This can be stacked endlessly:
lda #$04
.byte $2c
foo1 lda #$05
.byte $2c
foo2 lda #$06
.byte $2c
foo3 lda #$07
...
sta somewhere
rts
In much the same way, also one byte commands can be "ignored" using the opcode for BIT zeropage, $24.
Alternatively one can also use the illegal opcodes DOP and TOP to skip bytes. The advantage of those is, that flags stay untouched.
bmi .lz_short ;continue with y = $ff
.lz_far
eor #$ff
tay ;y = - a - 1
lda $beef,x
inx
bne .lz_join
jsr .lz_next_sector
top ;one could also branch/jump to skip the ldy #$ff
;but this way only one byte is needed
.lz_short
ldy #$ff
.lz_join
====== Penalty cycles ======
Those cycles can be a pain in the arse when doing cycle exact timing, but they can also steal us cycles on other occasions. So let us recall on when such an additional cycle is consumed:
* when our index + address crosses a page (read operations)
* branching over a page boundary
So to avoid wasting lots of cycles we need to align tables properly, usually to a page boundary to avoid an overflow on the index. If our code crosses a page we should avoid placing a loop at that edge, as else one penalty cycle is consumed on branching back to the beginning of the loop:
.C:0ffc A2 00 LDX #$00
.C:0ffe 9D 00 20 LDA $2080,X ;needs 1 cycle extra if X >= $80
.C:1001 E8 INX
.C:1002 D0 FA BNE $0FFE ;needs 4 cycles if branch is taken
Best is to add some warnings around important loops, so that you get a notice when you loop is badly aligned. For ACME for e.g. you could do that by comparing the highbytes of the loop's start- and end-address:
ldx #$00
loop1
sta $2000,x
inx
bne loop1
!if >* != >loop1 { !warn "loop1 crosses page!" }
Sometimes we are happy if we manage to have a branch always condition and thus save one byte of code. But be careful to not waste valueable cycles if this branch is going to cross a page boundary. In that case jmp is cheaper but wastes more bytes.
====== Forming terms ======
Sometimes forming terms helps in creating more efficient code. Imagine you have a term A - B where you want to reuse B afterwards, you might do:
lda $dead
sec
sbc $beef
sta $02
lda $beef
sta $03
;20 cycles
By forming the term to - B + A you will result in the same value, but now as the order is changed the reloading of B can be omitted. There's also the possibility to use LAX and TXA if you can allow for using X.
lda $beef
sta $03
sec
eor #$ff
adc $dead
sta $02
;18 cycles
So always try to form the term into something new and see if it performs better this way. So just remember the simple mathematic laws.
Now also think of that classical negation term:
lda num
eor #$ff
clc
adc #$01
sta neg
Depending on what you have in register A, you can express it in many different ways:
;a = $ff; carry set
eor num
adc #$00
sta neg
;a = $00; carry set;
sbc num
sta neg
;a = $ff; carry clear
adc num
eor #$ff
sta neg
;a = $00; carry clear;
adc #$01
sbc num
sta neg
;num in a, carry set
lda num
sbc #$01
eor #$ff
There are of course also other expressions possible, just ponder a while about the term. Also the carry flag after the negation can be influenced, depending on using sbc or adc for most cases ($00/$ff will cause an overflow).
How about forming terms with logical operations? We notice, that for e.g. (a + b) xor $ff is the same as (a xor $ff) - b:
lda num1
clc
adc num2
eor #$ff
;can also be written as
lda num1
eor #$ff
sec
sbc num2
====== Running out of registers ======
Under certain circumstances you can use the stack-pointer as a 4th register. That is, when the code segment using the txs/tsx is not called as a subroutine and returning at some point. Also no data and code must be located inside the stack, as the next IRQ will write three bytes at the stack-pointers position and thus trash data there. But as long as those prerequisites are matched, SP can be used as an additional register to stow away data.
lax (table),y
txs
ldx #$07
sbx #$00
ldy pos
...
...
tsx ;fetch value from table again
====== Limiting and masking ======
Sometimes it occurs, that we want to extract the low nibble of a value and limit it to a given range.
bpl .positive
cmp #$f0
bcs +
lda #$f0
+
and #$0f
As you can see, we limit the value to $f0 .. $ff first and then clamp of the highnibble to end up with values that range from $00..$0f
Observe, how this can be done cheaper, by just shifting the range and making use of the wrap around of 8 bits/carry:
bpl .positive
;clc
adc #$10
bcs +
lda #$00
+
We add $10 so the limit is then reached, depending on the carry. As we now wrapped the 8 bits by overflowing, the upper bits are already zero and we can forgo on the and #$0f component. The lownibble is not affected, as we focus on the lower 4 bits only.
====== Misc stuff ======
===== Postponing branches =====
Did you ever run into such a situation where you need to take a decision within your code but are in need of the registers for other purposes. Under certain circumstances you can perfectly work around that problem by compare in time and branching later on. This avoids reloading registers for the sake of comparing and thus wasting cycles:
-
;... some code
ldx counter
cpx #100 ;make decision here as long as X still contains counter -> sets/clears carry
ldy color1,x
lda color2,x
tax
lda color3
bcc - ;carry state still untouched, so we can still branch
===== Incrementing related pointers =====
Imagine you have a pointer into a bitmap and a corresponding screen. While the screenpointer increments/decrements by 1 the bitmappointer does so by 8. So usually one could do so by:
;increment
lda bmp
clc
adc #$08
sta bmp
bcc *+4
inc bmp+1
inc scr
bne *+4
inc scr+1
;decrement
lda bmp
sec
sbc #$08
sta bmp
bcs *+4
dec bmp+1
lda #$ff
dcp scr
bne *+4
dec scr+1
This needs bestcase 21 cycles.
But actually the check on the overflow of the lowbyte from the screenpointer is only necessary when the bitmappointer's lowbyte also overflows:
lda bmp
clc
adc #$08
sta bmp
inc scr
bcc ++ ;bitmap did not overrun, finished
bne + ;screen lowbyte did overrun?
inc scr+1
+
isc bmp+1 ;force carry clear for free (a = 0 -> a - bmp+1 will allways underflow)
++
Quite some brainfuck, but only needs 18 cycles bestcase. And here's the decrement case that saves another 2 cycles compared to the above example:
lda bmp ;could also use lax bmp, sbx #$08, stx bmp to save more cycles
sec
sbc #$08
sta bmp
bcs +
dcp bmp+1 ;forces carry set for free as long as bmp+1 is <= $f8 (the underflow result in A from above subtraction)
lda scr
bne +
dec scr+1
+
dec scr
===== Playing with the programcounter's wraparound =====
When you place code in the zeropage you might run out of space, but when placing code at the beginning of the zeropage and at the end of the ram above $ff80 you can easily branch back and forth and across the wrap-around. Means one can branch with a beq *-$70 back from zeropage to $ffxx and back to zeropage when branching into the other direction. Sometimes that is a good thing when you can avoid far branches done with a combination of a branch and jump.
So you can easily do things like the following iif your assembler supports it by labels, if not, you have to do the branches manually with a *+$xx or *-$xx:
fff3 90 18 BCC $000D ;directly go to zeropage, no need for a far jump, as this wraps around
...
002c 30 87 BMI $FFB5 ;same here
===== Reuse of intrinsic information =====
When your code handles conditions, the code branched to holds some intrinsic information, this can be used to restore values and registers or to save other overhead. The upcoming example unrolls a loop that sets bytes in mem from y to 7:
ldx ymul3,y
stx .bra+1
ldy #$07
.bra bpl *
sta (dst),y
dey
sta (dst),y
dey
sta (dst),y
dey
sta (dst),y
dey
sta (dst),y
dey
sta (dst),y
dey
sta (dst),y
dey
ymul3
!byte $00,$03,$06,$09,$0c,$0f,$12
As you see, the branch destination still contains the information about Y (Y * 3) that is then again restored by the right amount of dey. So this loop leaves Y unclobbered.
When [[dispatch_on_a_byte|dispatching on a byte]] the value being dispatched on can be safely assumed and for e.g. set in the corresponding code segment with ldy #dispatch_val.
===== Executable parameters =====
Sometimes we are happy and a parameter, be it in a register or in an other opcode happens to just be the opcode that we want to execute, either directly or in another case when branching. Thus the jmp $dd0c trick happens to work, but also other scenarios could be possible:
.C:0010 68 PLA
.C:0011 E9 00 SBC #$00
.C:0013 01 06 BCS $001B
.C:0015 69 00 ADC #$00
.C:0017 48 PHA
.C:0018 29 0F AND #$0F
.C:001a 85 48 STA $48
As A is modified in the one case a common point where both paths merge again with a PHA is impossible. But by storing A at a wise address ($48 = opcode PHA) we can successful merge both parts at $001c without further awkwardness like an additional branch/jump. Thanks a lot to lft for pointing me to this!
**HAPPY OPTIMIZING!**
Bitbreaker/Performers^Nuance