# Shell Sort (for 16-bit Elements)

by Fredrik Ramsberg, 29 Dec 2004

Here's my implementation of Shell Sort, a sort algorithm with rather amazing properties. All mathematicians that have tried to analyze this algorithm have failed, but there's lots of empirical data that suggests it's roughly O(n log n). I wish I could say I invented this algorithm, but I didn't. This is an equivalent Javascript routine that mimics the Shell Shot presented here:

function ShellSort(arr,length) { var j, i, v, h, k; for (h=1; h < length; h=3*h+1); while (h=(h-1)/3) for (i=h, j=i, v=arr[i]; i<=length; arr[j+h]=v, i++, j=i, v=arr[i]) while((j-=h) >= 0 && arr[j] > v) arr[j+h]=arr[j]; }

This is source code for what is meant to be an efficient implementation of Shell Sort in 6502 assembler. This implementation can sort more than 32,000 16-bit values. The only reason it can't sort 32,767 values is that there still has to be room for the routine and a few bytes for temporary storage. There's presently no other sorting routine in the repository that can handle more than 256 values. The source code is in a format suitable for the excellent and free ACME cross-assembler by Marco Baye, but should be easy to convert for other assemblers.

While Shell Sort is very good for entirely unsorted arrays, it is also reasonably good for almost sorted arrays. However, if you happen to know that very few values are out of place OR that the values that are out of place are not very far from their right position, Insertion Sort is a better choice. Insertion Sort is also provided here, since Shell Sort is really just a clever extension of Insertion Sort.

Here are some examples of sort times @ 1MHz (10,000 values):

Operation | Insertion Sort | Shell Sort |
---|---|---|

Array is entirely sorted from the start | 1.9s | 14.7s |

1 value is at the wrong end of the array | 2.9s | 15.7s |

10 values are at the wrong end of the array | 11.8s | 16.9s |

50 values are at the wrong end of the array | 51.3s | 17.6s |

Array is entirely unsorted | 2464.2s | 30.5s |

To call the routine, create a word-array at address nnnn in memory. The first word should contain the number of bytes to be sorted (= 2 * the number of elements), then come all those elements. Next, sort the elements using Shell Short like this:

lda #<nnnn ldx #>nnnn jsr shell_sort

or to perform an Insertion Sort:

lda #<nnnn ldx #>nnnn jsr insertion_sort

In the code snippits above, < means the low-byte and > means the high-byte. Some assemblers use x & $FF for the low-byte and nnnn » 8 for the high-byte.

Source Code for the Shell Sort (with Insertion Sort):

!to "shellsrt.o" ; An assembler directive to set out-file !sl "shelllbl.a" ; Tells the assembler to write all label ; values to a file *=$1000 ; Start address. Can safely be set to ; anything from $0100 to $fe00 j=$fb ; Uses two bytes. Has to be on zero-page j_plus_h=$fd ; Uses two bytes. Has to be on zero-page arr_length = j_plus_h ; Can safely use the same location as ; j_plus_h, but doesn't have to be on ZP shell_sort ldy #h_high - h_low - 1 bne sort_main ; Always branch insertion_sort ldy #0 sort_main sty h_start_index cld sta j sta in_address clc adc #2 sta arr_start stx j + 1 stx in_address + 1 txa adc #0 sta arr_start + 1 ldy #0 lda (j),y sta arr_length clc adc arr_start sta arr_end iny lda (j),y sta arr_length + 1 adc arr_start + 1 sta arr_end + 1 ; for (h=1; h < length; h=3*h+1); ldx h_start_index ; Start with highest value of h chk_prev_h lda h_low,x cmp arr_length lda h_high,x sbc arr_length + 1 bcc end_of_init ; If h < array_length, we've found the right h dex bpl chk_prev_h rts ; array length is 0 or 1. No sorting needed. end_of_init inx stx h_index ; while (h=(h-1)/3) h_loop dec h_index bpl get_h rts ; All done! get_h ldy h_index lda h_low,y sta h clc adc in_address ; ( in_address is arr_start - 2) sta i lda h_high,y sta h + 1 adc in_address + 1 sta i + 1 ; for (i=h, j=i, v=arr[i]; i<=length; arr[j+h]=v, i++, j=i, v=arr[i]) i_loop lda i clc adc #2 sta i sta j lda i + 1 adc #0 sta i + 1 sta j + 1 ldx i cpx arr_end lda i + 1 sbc arr_end + 1 bcs h_loop ldy #0 lda (j),y sta v clc adc #1 sta v_plus_1 iny lda (j),y sta v + 1 adc #0 bcs i_loop ; v=$ffff, so no j-loop necessary sta v_plus_1 + 1 dey ; Set y=0 ; while((j-=h) >= 0 && arr[j] > v) j_loop lda j sta j_plus_h sec sbc h sta j tax lda j + 1 sta j_plus_h + 1 sbc h + 1 sta j + 1 ; Check if we've reached the bottom of the array bcc exit_j_loop cpx arr_start sbc arr_start + 1 bcc exit_j_loop ; Do the actual comparison: arr[j] > v lda (j),y tax iny ; Set y=1 lda (j),y cpx v_plus_1 sbc v_plus_1 + 1 bcc exit_j_loop ; arr[j+h]=arr[j]; lda (j),y sta (j_plus_h),y dey ; Set y=0 txa sta (j_plus_h),y bcs j_loop ; Always branch ; for (i=h, j=i, v=arr[i]; i<length; arr[j+h]=v, i++, j=i, v=arr[i]) *** arr[j+h]=v part exit_j_loop lda v ldy #0 sta (j_plus_h),y iny lda v + 1 sta (j_plus_h),y jmp i_loop ; This describes the sequence h(0)=1; h(n)=k*h(n-1)+1 for k=3 (1,4,13,40...) ; All word-values are muliplied by 2, since we are sorting 2-byte values h_low !byte <2, <8, <26, <80, <242, <728, <2186, <6560, <19682 h_high !byte >2, >8, >26, >80, >242, >728, >2186, >6560, >19682 h_start_index !byte 0 h_index !byte 0 h !word 0 in_address !word 0 arr_start !word 0 arr_end !word 0 i !word 0 v !word 0 v_plus_1 !word 0

To increase speed and reduce code size, you can optionally place one or more of these 2-byte fields on zero-page (the suggested values work on a Commodore 64):

v_plus_1 = $5 h = $7 arr_start = $A

Some simple tests using an array of 10,000 completely unsorted values showed a 5.6% shorter execution time if all three fields were placed on ZP, with v_plus_1 being a little more important than the others.

To go even further, placing these 2-byte fields on zero-page will provide a small improvement:

v i arr_end

(This paragraph is added by litwr.) It is possible to speed up this sort by 15-25%. This requires only to change *h_high* and *h_low* tables. For example,

h_low .byte <2, <8, <20, <46, <114, <264, <602, <1402, <3500, <9518, <25846 h_high .byte >2, >8, >20, >46, >114, >264, >602, >1402, >3500, >9518, >25846

will make the trick.