by Vladimir Lidovski aka litwr, 13 Aug 2016 (with help of BigEd)
It is well known that the best, the fastest sort routine is Quicksort. It is very odd that its implementations for 6502 for all of 42 years (from 1975 to 2016) have a bit blurred and unofficial status. The main problem is in the stack depending nature of Quicksort and the stack limit of 256 bytes of 6502 architecture. It is solvable.
The next Pascal code was translated to 6502 assembler.
(* it sorts array of longint Basearray with the index range from 1 to Size *) (* it should be invoked by QuickSort(1, Size) *) (* Basearray elements and x local variable type maybe any numeric, string, set, ... type *) Procedure QuickSort(LBound, UBound: word); var i, j: word; x: longint; begin i := LBound; j := UBound; x := BaseArray[(i + j) div 2]; repeat while BaseArray[i] > x do inc(i); while x > BaseArray[j] do dec(j); if i <= j then begin Exchange(i, j); inc(i); dec(j) end until i > j; if LBound < j then QuickSort(LBound, j); if i < UBound then QuickSort(i, UBound) end;
TMPx assembler is used but it is very easy to convert it to any other 6502 assembler syntax.
i2lo = zp0lo i2hi = zp0hi j2lo = zp2lo j2hi = zp2hi x_lo = m2lo x_hi = m2hi quicksort tsx cpx #16 ;stack limit bcs qsok qs_csp ldx #0 dex dex txs qsok lda $103,x sta i2lo lda $104,x sta i2hi ldy $106,x sty j2hi lda $105,x sta j2lo clc ;this code works only for the even align adc i2lo and #$fc sta zp1lo tya adc i2hi ror sta zp1hi ror zp1lo ldy #0 lda (zp1lo),y sta x_lo iny lda (zp1lo),y sta x_hi qsloop1 ldy #0 ;compare array[i] and x lda (i2lo),y cmp x_lo iny lda (i2lo),y sbc x_hi bcs qs_l1 lda #2 adc i2lo sta i2lo bcc qsloop1 inc i2hi bne qsloop1 ;=jmp qs_l1 ldy #0 ;compare array[j] and x lda x_lo cmp (j2lo),y iny lda x_hi sbc (j2lo),y bcs qs_l3 lda j2lo sbc #1 sta j2lo bcs qs_l1 dec j2hi bne qs_l1 ;=jmp qs_l3 lda j2lo ;compare i and j cmp i2lo lda j2hi sbc i2hi bcc qs_l8 qs_l6 lda (j2lo),y ;exchange elements with i and j indices tax lda (i2lo),y sta (j2lo),y txa sta (i2lo),y dey bpl qs_l6 ;clc lda #1 ;CY=1 adc i2lo sta i2lo bcc *+4 inc i2hi sec lda j2lo sbc #2 sta j2lo bcs *+4 dec j2hi ;lda j2lo cmp i2lo lda j2hi sbc i2hi ;bcc *+5 ;jmp qsloop1 bcs qsloop1 qs_l8 tsx lda $103,x cmp j2lo lda $104,x sbc j2hi bcs qs_l5 lda i2hi pha lda i2lo pha lda j2hi pha lda j2lo pha lda $104,x pha lda $103,x pha jsr quicksort pla pla pla pla pla sta i2lo pla sta i2hi tsx qs_l5 lda i2lo cmp $105,x lda i2hi sbc $106,x bcs qs_l7 lda $106,x pha lda $105,x pha lda i2hi pha lda i2lo pha jsr quicksort pla pla pla pla qs_l7 rts
zp0lo, zp0hi, zp1lo, zp1hi, zp2lo, zp2hi are zero page bytes. Low byte should precede high. m2lo and m2hi are bytes which maybe situated anywhere in RAM but it is better for speed to put them at zero page too. The invocation should be in the next form.
lda #>array+size2-2 pha lda #<array+size2-2 pha lda #>array pha lda #<array pha tsx stx qs_csp+1 jsr quicksort pla pla pla pla
array is the address of the array and size2 is its size in bytes. So size2 is twice bigger then the number of 2 bytes integer in the array for the sort. array maybe any even address above $200.
The main trick is the work with the stack. The routine reserves 16 stack bytes for interrupts. It is possible to reduce this number to 0 if all interrupts are disabled. It maybe made dynamically. So if the stack has only 16 bytes free then the interrupts should be disabled and if there is more than 16 free bytes than the interrupts should be enabled. This dynamic control will consume only about 10 more bytes in the code but its effect is small. Some systems use NMI interrupts and this makes such dynamic control more complex. The systems with NMI may require more than 16 free bytes in the stack. IMHO 30 bytes will be enough for any system.
Several time measurements (in seconds) were made for the Quicksort and the Fredrik Ramsberg's Shell and Insertion sorts. Commodore +4 is used. Its CPU works at 1.14 MHz average frequency.
|1024 Integers||Sort Type||Random||Ordered||Reversed||Zeros|
|4096 Integers||Sort Type||Random||Ordered||Reversed||Zeros|
|12288 Integers||Sort Type||Random||Ordered||Reversed||Zeros|
Random, Ordered, Reversed, Zeros mean the type of array filling. Random filling just copies ROM content into array. Ordered filling uses numbers from 0 with step 1. Reversed filling uses numbers from $ffff with step -1. Zeros filling is just an array filled with the zeros only.
So Quicksort is more than two times faster than Shell Sort. Its code occupies together with its call wrapping 259 bytes. It also uses 6 bytes of zero page and 2 bytes of any RAM. The Shell sort requires about 250 bytes for the code and data and it doesn't use stack. It also uses up to 14 bytes at zero page. All these 14 bytes were used in the mentioned above measurements.
This version of Quicksort requires almost all stack (about 240 bytes) to sort fast more than 32 KB of data. The required minimum is about 176 bytes but the stack with the only such minimum may slow down sorting dramatically. If the stack has less free bytes than this minimum then this may give the meditation, the endless loop.
It is possible to reduce the stack load by tail call elimination. It makes Quicksort slightly faster (only by 3-4%) but reduces the stack load more than 50%. So 128 free bytes in the stack will make the sort of 60 KB data without delays.
i2lo = zp0lo i2hi = zp0hi j2lo = zp2lo j2hi = zp2hi x_lo = m2lo x_hi = m2hi ublo = m3lo ubhi = m3hi lblo = m4lo lbhi = m4hi quicksort0 tsx cpx #16 ;stack limit bcs qsok qs_csp ldx #0 txs quicksort lda #>array+size2-2 sta ubhi lda #<array+size2-2 sta ublo lda #>array sta lbhi lda #<array sta lblo tsx stx qs_csp+1 qsok lda lblo sta i2lo lda lbhi sta i2hi ldy ubhi sty j2hi lda ublo sta j2lo clc ;this code works only for the even align adc i2lo and #$fc sta zp1lo tya adc i2hi ror sta zp1hi ror zp1lo ldy #0 lda (zp1lo),y sta x_lo iny lda (zp1lo),y sta x_hi qsloop1 ldy #0 ;compare array[i] and x lda (i2lo),y cmp x_lo iny lda (i2lo),y sbc x_hi bcs qs_l1 lda #2 adc i2lo sta i2lo bcc qsloop1 inc i2hi bne qsloop1 ;=jmp qs_l1 ldy #0 ;compare array[j] and x lda x_lo cmp (j2lo),y iny lda x_hi sbc (j2lo),y bcs qs_l3 lda j2lo sbc #1 sta j2lo bcs qs_l1 dec j2hi bne qs_l1 ;=jmp qs_l3 lda j2lo ;compare i and j cmp i2lo lda j2hi sbc i2hi bcc qs_l8 qs_l6 lda (j2lo),y ;exchange elements with i and j indices tax lda (i2lo),y sta (j2lo),y txa sta (i2lo),y dey bpl qs_l6 ;sec lda #1 ;CY=1 adc i2lo sta i2lo bcc *+4 inc i2hi sec lda j2lo sbc #2 sta j2lo bcs *+4 dec j2hi ;lda j2lo cmp i2lo lda j2hi sbc i2hi ;bcc *+5 ;jmp qsloop1 bcs qsloop1 qs_l8 lda lblo cmp j2lo lda lbhi sbc j2hi bcs qs_l5 lda i2hi pha lda i2lo pha lda ubhi pha lda ublo pha lda j2hi sta ubhi lda j2lo sta ublo jsr quicksort0 pla sta ublo pla sta ubhi pla sta i2lo pla sta i2hi qs_l5 lda i2lo cmp ublo lda i2hi sbc ubhi bcs qs_l7 lda i2hi sta lbhi lda i2lo sta lblo jmp qsok qs_l7 rts
The locations m3hi, m3lo, m4hi, m4lo maybe situated anywhere in RAM. The invocation just a call to quicksort routine. However it is required to put the proper constants after quicksort label. This makes the invocation code more complex in the general case.