When I say I partake in the occasional bit of golf, I don't mean the sport involving balls and clubs. I've never tried that, but I have gone in for code golf in the past; the object of code golf is similar in that you're aiming for a low number. For code golf, the aim is to achieve the required functionality or implement the algorithm with the smallest code.
Explorations of the smallest possible thing in a particular area aren't unprecedented here: the first article on this site is The Smallest Nintendo DS ROM, from 2006. So when I heard about the Binary Golf Grand Prix I was left with no choice but to put in an attempt. Unfortunately, I learned of the existence of BGGP #4 around two hours before it closed for submissions, so I ended up handing in a scrap of PHP. But let's look at what could've been...
The Task at Hand
This year's problem is specified as follows:
Create the smallest self-replicating file.
A valid submission will:
- Produce exactly 1 copy of itself
- Name the copy "4"
- Not execute the copied file
- Print, return, or display the number 4
The most natural thought that arises, unbidden, is how one would go about this on the Commodore 64. Commodore BASIC has the built-in SAVE
command, so our submission might be as simple as:
C64 BASIC program that saves itself to disk
10 SAVE "4", 8 20 PRINT 4
As the task here is to generate the smallest file, we should look at how this program is stored in memory (and thus, on disk). Commodore BASIC stores programs almost entirely as written, and the code execution process includes a parsing step which tokenises the program before it's executed. So the above program looks like this if we use VICE's memory monitor:
The above BASIC program, in memory
(C:$e5cf) m 0800 >C:0800 00 0e 08 0a 0094 20 22 34 22 2c 20 38 0016 08 ...... "4", 8... >C:0810 14 0099 20 34 0000 00 00 00 ff ff ff ff 00 00 ... 4.....????..
BASIC program storage and tokenisation
We can see the program starts at address $0801
hexadecimal: this is the case for every BASIC program on the C64, whether typed in or loaded. Each line starts with a pointer to the next line, allowing the BASIC interpreter to rapidly work through the program if it needs to determine the program length or find a particular line to change its content.
After the pointer to the next line comes the line number: two bytes, in little-endian format (small byte first), which is standard for the 6502-derived processor inside the C64. So we see the first line is numbered $000a
, 10. And after this comes the line content itself: 94 20 22 34 22 2c 20 38 00
.
As a program is typed into the BASIC interpreter, it's tokenised: any keywords in the line get replaced by token values before being stored in memory. We can see in this line that SAVE
has been replaced by command token $94
, such that if this value comes at the start of a command the interpreter knows this is a SAVE
command without needing to read and parse the individual letters of the word "SAVE".
The rest of the command is not tokenised, and responsibility for parsing this arguments string is passed to whichever routine in the BASIC interpreter is handling the command. In this case, the rest of the command is ASCII text: <space>"4",<space>8
; note that the spaces are preserved as they were entered in the program. The arguments are terminated by a zero byte.
The second line is similarly structured: a $99
token representing the PRINT
command, and an arguments string of 20 34
, representing <space>4
.
Golfing in C64 BASIC
Now we've seen how the BASIC program is stored internally, we have direction for a round of golf. The first step is to remove the spaces:
Golf, first hole: Removing spaces
10SAVE"4",8 20PRINT4
One might be tempted to remove the quotes around the filename, relying on whatever type coercion may exist in the BASIC interpreter to convert 4
into a string. Unfortunately, this isn't a thing on the C64:
Golf, bogey on the second hole
10SAVE4,8 20PRINT4 RUN ?TYPE MISMATCH ERROR IN 10
Commodore BASIC does, however, support multiple commands on one line, with the colon separator. Thus, we can combine the two commands of our program and remove the four bytes associated with the second line pointer and line number:
Golf, third hole: Combining lines
10SAVE"4",8:PRINT4
This program is stored in memory as below:
Fully golfed out BASIC program
(C: $e5cf) m 0800 >C:0800 00 0f 08 0a 0094 22 34 22 2c 383a99 34 0000 ......"4",8:.4..
We see that the two commands have been stored as one line, so there's one next-line pointer (which refers to the byte after the end of the program); it also becomes plain that a zero byte isn't the only thing that can end an arguments string in C64 BASIC. The colon, byte $3a
, can also act as an end-of-arguments delimiter, denoting the end of a command and the start of another.
Our final result in Commodore BASIC is 14 bytes. Is it possible to produce a smaller program by switching to machine language, and communicating with the disk drive through Commodore DOS directly?
Saving files in Commodore DOS
One advantage of using such a venerable computer as the C64 for this task is that the documentation is extensive and copious. There are eleven separate commentaries on the SAVE
routine available in the C64 Kernal API reference, including detailed explanation of the internal workings; for our purposes, we'll use the example routine provided by Commodore's own Programmer's Reference, which gives a sequence of operations for using SAVE
:
- Use the SETLFS routine and the SETNAM routine (unless a SAVE with no file name is desired on "a save to the tape recorder"),
- Load two consecutive locations on page 0 with a pointer to the start of your save (in standard 6502 low byte first, high byte next format).
- Load the accumulator with the single byte page zero offset to the pointer.
- Load the X and Y registers with the low byte and high byte re- spectively of the location of the end of the save.
- Call this routine.
In this case, we'll be saving to disk (device 8), so we'll want to call both SETLFS
and SETNAM
before running through the save. There are two things we'll need to note for the save itself:
- Load location
-
As mentioned above, when a BASIC program is loaded from disk, it always loads to the same location (the start of BASIC memory,
$0801
hexadecimal); a machine-language program doesn't have this restriction, and can be loaded into memory anywhere. To facilitate this, the first two bytes of the program as stored on disk are actually the address to which it should be loaded. Accordingly, when saving the program this must be accounted for by setting up the load address in memory beforehand (in little-endian format, as the Programmer's Reference mentions above).We also have the extra stipulation that this program needs to be self-replicating: if the file saved by this code is reloaded, it will need to behave the same way. To that end, the load address for the program will need to be written by the program itself, to a location just before the program, and the concatenated block saved together.
- Program size
BASIC keeps track of how long the entered program is, so it can determine how much memory needs to be saved to disk; our machine-language program will have to set up the amount of data to save without this help being available.
At this point, it's already becoming obvious that our machine-language attempt will be longer than 14 bytes to perform the same task as the BASIC program above, but to quote from a certain classic game show:
I've started, so I'll finish...
C64 machine language program that saves itself to disk
FILENAME = $002A; An unused byte in zero pagePROGPTR = $009B; An unused block of two bytesPROGSTART = $C0C0; Program runs from hereMEMSTART = $C0BE; SAVE starts from hereprocessor 6502 org PROGSTART start: lda #1; Logical file 1ldx #8; Disk drive 0 (device 8)ldy #255; No secondary commandjsr $FFBA; SETLFSlda #'4; Our filename: "4"sta FILENAME; Store the namelda #1; It's one byte longldx #<FILENAME; And it's stored in zero pageldy #>FILENAME; at this addressjsr $FFBD; SETNAMlda #$c0; Our load location ($C0C0)sta MEMSTART; Written to the two bytessta MEMSTART + 1; before the program in memorylda #<MEMSTART; Set up our indirect pointersta PROGPTR; from the start of the filelda #>MEMSTART; to our two-byte pointer storesta PROGPTR + 1; in zero pagelda #<PROGPTR; Pointer to the start of fileldx #<end; Actual location of the end of fileldy #>end; (including the two-byte header)jsr $FFD8; SAVElda #4; We need to print or return 4rts; So let's return 4end:
Conclusion and caveats
The above machine-language program clocks in at 50 bytes, almost four times as large as the equivalent BASIC routine. There's some scope for reducing that number, but we don't have a hope of reaching the 14 bytes that a higher-level representation affords us. It should also be noted that we're not attempting to print the number 4 as per the task description, as this would require another call out to the Kernal and/or recycling of BASIC routines to do the same.
There is a level below this, of course, which we haven't reached today: direct communication with the disk drive to push the program to disk, bypassing Commodore DOS (at least at the computer end). The Kernal API is an abstraction on top of this, meaning we don't have to worry about the disk drive's serial bus and other vagaries of the implementation.
This little game of code golf has allowed us a look into why higher-level languages exist at all: not only do they allow us to perform tasks such as "save a file to disk" with less code, they also abstract away complications such as setting the filename before writing a file. Perhaps BASIC shouldn't be maligned quite so much.
Thanks go to the Online 6502 Disassembler by Norbert Landsteiner, and the dASM assembler whose name always makes me think it's a disassembler.