This is an old revision of the document!

Quicksort (for 16-bit Elements)

by Vladimir Lidovski aka litwr, 13 Aug 2016

It is well known that the best, the fastest sort routine is Quicksort. It is very odd that it is not implemented for 6502 during 42 years, since 1975 to 2016. The main problem is in the stack depending nature of Quicksort and the stack limit of 256 bytes of 6502 architecture. It is solvable.

The next Pascal code was translated to 6502 assembler.

(* it sorts array of longint Basearray with the index range from 1 to Size *)
(* it should be invoked by QuickSort(1, Size) *)
(* Basearray elements and x local variable type maybe any numeric, string, set, ... type *)
Procedure QuickSort(LBound, UBound: word);
  var
    i, j: word;
    x: longint;
  begin
    i := LBound;
    j := UBound;
    x := BaseArray[(i + j) div 2];
    repeat
      while BaseArray[i] > x do inc(i);
      while x > BaseArray[j] do dec(j);
      if i <= j then begin
        Exchange(i, j);
        inc(i);
        dec(j)
      end
    until i > j;
    if LBound < j then QuickSort(LBound, j);
    if i < UBound then QuickSort(i, UBound)
  end;

TMPx assembler is used but it is very easy to convert it to any other 6502 assembler syntax.

i2lo = zp0lo
i2hi = zp0hi
j2lo = zp2lo
j2hi = zp2hi
x_lo = m2lo
x_hi = m2hi

quicksort tsx
          cpx #16  ;stack limit
          bcs qsok

qs_csp    ldx #0
          dex
          dex
          txs
qsok      lda $103,x
          sta i2lo
          lda $104,x
          sta i2hi
          ldy $106,x
          sty j2hi
          lda $105,x
          sta j2lo
          clc        ;this code works only for the even align
          adc i2lo
          and #$fc
          sta zp1lo
          tya
          adc i2hi
          lsr
          sta zp1hi
          ror zp1lo

          ldy #0
          lda (zp1lo),y
          sta x_lo
          iny
          lda (zp1lo),y
          sta x_hi
qsloop1   ldy #0
          lda (i2lo),y
          cmp x_lo
          iny
          lda (i2lo),y
          sbc x_hi
          bcs qs_l1

          lda #2
          adc i2lo
          sta i2lo
          bcc qsloop1

          inc i2hi
          bne qsloop1 ;=jmp

qs_l1     ldy #0
          lda x_lo
          cmp (j2lo),y
          iny
          lda x_hi
          sbc (j2lo),y
          bcs qs_l3

          lda j2lo
          sbc #1
          sta j2lo
          bcs qs_l1

          dec j2hi
          bne qs_l1 ;=jmp

qs_l3     lda j2lo
          cmp i2lo
          lda j2hi
          sbc i2hi
          bcc qs_l8

qs_l6     lda (j2lo),y
          tax
          lda (i2lo),y
          sta (j2lo),y
          txa
          sta (i2lo),y
          dey
          bpl qs_l6

          ;clc
          lda #1        ;CY=1
          adc i2lo
          sta i2lo
          bcc *+4
          inc i2hi
          sec
          lda j2lo
          sbc #2
          sta j2lo
          bcs *+4
          dec j2hi
          ;lda j2lo
          cmp i2lo
          lda j2hi
          sbc i2hi
          ;bcc *+5
          ;jmp qsloop1
          bcs qsloop1

qs_l8     tsx
          lda $103,x
          cmp j2lo
          lda $104,x
          sbc j2hi
          bcs qs_l5

          lda i2hi
          pha
          lda i2lo
          pha
          lda j2hi
          pha
          lda j2lo
          pha
          lda $104,x
          pha
          lda $103,x
          pha
          jsr quicksort
          pla
          pla
          pla
          pla
          pla
          sta i2lo
          pla
          sta i2hi
          tsx
qs_l5     lda i2lo
          cmp $105,x
          lda i2hi
          sbc $106,x
          bcs qs_l7

          lda $106,x
          pha
          lda $105,x
          pha
          lda i2hi
          pha
          lda i2lo
          pha
          jsr quicksort
          pla
          pla
          pla
          pla
qs_l7     rts

zp0lo, zp0hi, zp1lo, zp1hi, zp2lo, zp2hi are zero page bytes. Low byte should precede high. m2lo and m2hi are bytes which maybe situated anywhere in RAM but it is better for speed to put them at zero page too. The invocation should be in the next form.

          lda #>array+size2-2
          pha
          lda #<array+size2-2
          pha
          lda #>array
          pha
          lda #<array
          pha
          tsx
          stx qs_csp+1
          jsr quicksort
          pla
          pla
          pla
          pla

array is the address of the array and size2 is its size in bytes. So size2 is twice bigger then the number of 2 bytes integer in the array for the sort. array maybe any even address above $200.

The main trick is the work with the stack. The routine reserves 16 stack bytes for the interrupts. It is possible to reduce this number to 0 if the interrupts are disabled. It maybe made dynamically. So if stack has only 16 bytes free then the interrupts should be disabled and if there is more than 16 free bytes than the interrupts should be enabled. This dynamic control will consume only about 10 more bytes in the code but its effect is small. Some systems use NMI interrupts and this makes such dynamic control more complex. The systems with NMI may require more than 16 free bytes in stack. IMHO 30 bytes will be enough for any system.

Several time measurements (in seconds) were made for the Quicksort and the Fredrik Ramsberg's Shell and Insertion sorts. Commodore +4 is used. Its CPU works at 1.14 MHz average frequency.

Sort Type	Random	Ordered	Reversed	Zeros
Insertion	21.4	0.14	39.75	0.16
Shell	1.75	0.8	1.12	0.82
Quick	0.75	0.4	0.43	0.9

Sort Type	Random	Ordered	Reversed	Zeros
Insertion	317.98	0.6	635.13	0.58
Shell	8.12	4.02	5.25	4
Quick	3.63	1.85	2.02	4.13

Sort Type	Random	Ordered	Reversed	Zeros
Insertion	2877.48	1.78	5714.08	1.75
Shell	30.18	13.82	18.05	13.81
Quick	11.67	6.78	7.28	12.93

Random, Ordered, Reversed, Zeros mean the type of array filling. Random filling just copies ROM content into array. Ordered filling uses numbers from 0 with step 1. Reversed filling uses numbers from $ffff with step -1. Zeros filling is just an array filled with the zeros only.

So Quicksort is more than two times faster than Shell Sort. Its code occupies together with its call wrapping 259 bytes. It also uses 6 bytes of zero page and 2 bytes of any RAM. The Shell sort requires about 250 bytes for the code and data and it doesn't use stack. It also uses up to 14 bytes at zero page. All these 14 bytes were used in the mentioned above measurements.