The following is the verion of my essay on source vs. object code that was submitted to the Court in the New York DVD trial on July 25, 2000. This is still only a rough draft. It needs more work, but we ran out of time. A number of people contributed information and insightful comments that improved this draft or helped to fine tune my testimony. I'd like to thank Hal Abelson, Andrew Appel, Mark Fuhs, and Scott Goehring for their assistance. -- Dave Touretzky ================================================================ Source vs. Object Code: A False Dichotomy David S. Touretzky Computer Science Department Carnegie Mellon University *** DRAFT VERSION OF JULY 12, 2000 *** The notion of source and object code as disjoint classes of computer code is a false dichotomy common among non-programmers. The general public's understanding is that computer programs are written as "source code" that is human readable and not immediately executable by the machine. Source code is supposed to contain meaningful variable names and helpful comments that are intended only to be read by humans. A piece of software called a "compiler" must convert source code to object code before the program described in the source code can be executed. Object code cannot be read by people; it is a sequence of bytes that encode specific machine instructions that will be executed by the microprocessor when it runs (executes) the program. The above statements are not exactly false; they are in fact the usual way of explaining to laymen how a computer works. However, any attempt to draw legal distinctions between "source" code and "object" code, or between code that is executable and code that is not, will immediately encounter difficulties, because these dichotomies are only a convenient fiction. Computer scientists don't view things this way at all. Here are some of the problems: 1. "Source" and "object" are not well-defined classes of code. They are actually relative terms. Given a device for transforming programs from one form to another, source code is what goes into the device, and object code (or "target" code) is what comes out. The target code of one device is frequently the source code of another. For example, both early C++ compilers and the Kyoto Common Lisp compiler produced C code as their output. C++ (or Common Lisp) was the source language; C was the target language. This output was then run through a C compiler to produce symbolic assembler code as output. This then had to be run through yet another proram, the "assembler", to produce binary machine code that could be directly executed by a processor. In summary, programs typically go through a series of transformations from higher level to lower level languages. The assembler language code that a C compiler produces as "object" code is source code for the assembler. Today, the GNU C compiler (known as gcc) will deliver assembler language code instead of binary if the user requests it by specifying the -S switch. 2. Even binary machine code is perfectly readable by humans. It was, after all, designed by humans. It may be tedious to read, but this can be helped somewhat by using a program called a disassembler to translate the raw binary instructions back into symbolic form. For example, the Pentium instruction "add 7 to register AL" is written 0000010000000111 in the machine's binary language; it is written in symbolic form as "ADD AL,7". (Reference: Intel Architecture Developer's Manual, 1997 edition, volume 2, page 3-17.) Converting between these two forms is trivial. The instant litigation seems to be a result of a person having read and understood a piece of machine code. A document signed by 1) a Norwegian minor, Jon Johansen, 2) an "anonymous German cracker" from the group MoRE (Masters of Reverse Engineering), who is designated the group's "top coder/disassembler", and 3) an individual referred to only as "[dEZZY/DoD]", suggests that the original CSS "crack" was performed by disassembling a software DVD player and rewriting the descrambling algorithm in C. The "anonymous C source code" in my Gallery of CSS Descramblers (http://www.cs.cmu.edu/~dst/DeCSS/Gallery) may be the fruit of that effort. 3. Human-readable programs do not have to be compiled in order to be executed. Compilation merely transforms the program into a form that can be executed more efficiently. But a piece of software called an "interpreter" can execute source code directly, without compiling it. Programs normally run more slowly when they are being executed by an interpreter than when they are first compiled into machine instructions that the processor can execute directly. But they do run. And interpreters have some advantages: they are easier to write than compilers, and they are better suited for debugging purposes. An interpreter for the C programming language, called EIC, is available for free on the web at http://www.anarchos.com/eic/. There are other C interpreters as well. Some languages, such as MATLAB, are normally executed in interpreted mode. Today, most laser printers have a computer inside running a PostScript interpreter. When a document prepared in a text editor such as Microsoft Word is to be printed, it is first converted to a Postscript program. The Postscript interpreter inside the printer then executes the program to construct the page image that is sent to the printing engine. Postscript programs are easily human readable; they do not even require a disassembler. Yet they are executable. 4. Binary "machine code" isn't necessarily directly executable by the computer's microprocessor. Languages such as Java and PERL compile to what is called 'byte code", or "virtual machine code". This is binary code for an idealized, imaginary processor that does not correspond to any particular commercial processor architecture, such as the the Pentium, SPARC, or PowerPC architectures. The virtual machine code must then be executed by a piece of software called a "byte code interpreter" that simulates the imaginary processor. The advantage of this approach is that it allows one to quickly produce Java or Perl implementations for new computer architectures, because the same compiler can be used. Only a new byte code interpreter is required. Byte code interpreters are easier to write than compilers. Another approach taken by some implementations is to compile a piece of Java byte code into native machine instructions (e.g., Pentium instructions, SPARC instructions, or PowerPC instructions) when that piece begins to execute. In this scheme, the Java byte code becomes the "source code" and the native mode instructions are the target code. The new Crusoe processor from Transmeta treats Pentium code as virtual machine code. Although it appears to execute Pentium code, internally the chip translates the code into another machine language that is considerably more complex, in order to take better advantage of pipelining and other optimization techniques. Hence, what the layman would call object code is actually the source code for the Crusoe's transformation process. 5. "Binary executable" code sometimes isn't. Consider the DECSS.EXE file that is the subject of the instant litigation. This file contains a program expressed as machine instructions directly executable by a Pentium processor. It also includes "system calls", specific to the Microsoft Windows operating system, to perform input/output operations. If this file were downloaded to a Sun SPARC workstation running Unix, or a PowerPC running the Macintosh operating system, it would not be directly executable. But all is not lost. The SPARC or PowerPC owner could construct a Pentium emulator program to simulate the operation of the Pentium processor. Such programs are commonplace; in fact they are a necessary step in the development of any new processor architecture. He or she would also need to build a small portion of a Windows emulator in order to handle the Windows system calls. At that point, DECSS.EXE could be executed by the emulator program. An alternative approach would be to build a translator program to convert Pentium instructions to SPARC or PowerPC native instructions. Again, this is a well-known technique in computer science and not technically difficult. DECSS.EXE would then be the input, or source code, for the emulator or translator program. 6. Executable binary code has much of the same expressive content as code written in a symbolic language, be it assembly language or a higher level language such as C or Lisp. Appendix A shows a small C program for computing 5 factorial (the product of the integers from 1 to 5). Appendix B shows the assembler language output produced by gcc version 2.95.3 on a Sun UltraSparc 170 running the Solaris operating system. Appendix C is a dump of the binary output file produced by the assembler. (The values are actually displayed in hexadecimal, which is a more compact form of binary.) Appendix D is the result of disassembling the binary file, going back to symbolic assembler language code. Note that line number 20 in Appendix D contains an integer compare instruction, "cmp %o0, 5". The equivalent hexadecimal form is also shown: it is 80a22005. Appendix C shows this same sequence in the binary file, mid-way through the line labeled 0000220. The same instruction, "cmp %o0, 5", also appears in Appendix B, two lines after the label .LL3. And the equivalent in the C code of Appendix A is the sequence "i<6". The fact that the constant is 6 in the C code and 5 in the assembly language code is the sort of thing one can learn only by looking at the assembly language code, or the binary file that results from it. (The reason for the difference, 5 vs. 6, has to do with the particular strategy used by the gcc compiler to implement the FOR loop.) The lessons that should be drawn from the above are: 1) All computer code is human readable. Some forms are simply more convenient to read than others. 2. All computer code is expressive. Many of the ideas expressed in C code are also expressed in the assembly language code that results from compiling that C code, and again in the binary machine language that is the output of the assembler. Some content may be lost, e.g., source code comments are typically not preserved in object code, although variable names may be. But some ideas that are only implicit in the source code may be made more apparent in the object code, such as how a particular sequence of actions should be best expressed in terms of processor operations in order to obtain maximum performance from the machine. 3) All computer code is executable. In some instances it may be advantageous to transform the code into another form first, but transformation is by no means mandatory. An interpreter can be employed instead. Interpreters are in common use in computer systems. 4) "Source" and "object" are relative terms, not absolute categories. 5) The file DECSS.EXE is a particular expression of an algorithm for converting video files from one format to another. It expresses the same algorithm as the C code from which it was compiled. DECSS.EXE is coded in a binary language that is more tedious to read than the C code, but more efficiently executable by a Pentium processor. These are differences in degree only. If C code is protected speech because of its expressive content --- and one can argue that a computer program is nothing but expressive content -- then code written in a binary machine language that expresses the same algorithm should not be regarded any differently. ================================================================ APPENDIX A: C source file fact.c #includevoid main(int argc, char *argv[]) { int i, result; result = 1; for (i = 1; i<6; i++) { result = result * i; } printf ("Result is: %d.\n",result); } ================================================================ APPENDIX B: assembler language file fact.s (produced by: gcc -S fact.c) .file "fact.c" gcc2_compiled.: .global .umul .section ".rodata" .align 8 .LLC0: .asciz "Result is: %d.\n" .section ".text" .align 4 .global main .type main,#function .proc 020 main: !#PROLOGUE# 0 save %sp, -120, %sp !#PROLOGUE# 1 st %i0, [%fp+68] st %i1, [%fp+72] mov 1, %o0 st %o0, [%fp-24] mov 1, %o0 st %o0, [%fp-20] .LL3: ld [%fp-20], %o0 cmp %o0, 5 ble .LL6 nop b .LL4 nop .LL6: ld [%fp-24], %o0 ld [%fp-20], %o1 call .umul, 0 nop st %o0, [%fp-24] .LL5: ld [%fp-20], %o0 add %o0, 1, %o1 st %o1, [%fp-20] b .LL3 nop .LL4: sethi %hi(.LLC0), %o1 or %o1, %lo(.LLC0), %o0 ld [%fp-24], %o1 call printf, 0 nop .LL2: ret restore .LLfe1: .size main,.LLfe1-main .ident "GCC: (GNU) 2.95.2 19991024 (release)" ================================================================ APPENDIX C: binary file fact.o (produced by: gcc -c fact.c; dumped by: od -x fact.o) 0000000 7f45 4c46 0102 0100 0000 0000 0000 0000 0000020 0001 0002 0000 0001 0000 0000 0000 0000 0000040 0000 0234 0000 0000 0034 0000 0000 0028 0000060 0008 0001 002e 7368 7374 7274 6162 002e 0000100 7465 7874 002e 726f 6461 7461 002e 7379 0000120 6d74 6162 002e 7374 7274 6162 002e 7265 0000140 6c61 2e74 6578 7400 2e63 6f6d 6d65 6e74 0000160 0000 0000 9de3 bf88 f027 a044 f227 a048 0000200 9010 2001 d027 bfe8 9010 2001 d027 bfec 0000220 d007 bfec 80a2 2005 0480 0004 0100 0000 0000240 1080 000c 0100 0000 d007 bfe8 d207 bfec 0000260 4000 0000 0100 0000 d027 bfe8 d007 bfec 0000300 9202 2001 d227 bfec 10bf fff2 0100 0000 0000320 1300 0000 9012 6000 d207 bfe8 4000 0000 0000340 0100 0000 81c7 e008 81e8 0000 0000 0000 0000360 5265 7375 6c74 2069 733a 2020 2564 2e0a 0000400 0000 0000 0000 0001 0000 0000 0000 0000 0000420 0400 fff1 0000 0001 0000 0000 0000 0000 0000440 0400 fff1 0000 0000 0000 0000 0000 0000 0000460 0300 0003 0000 0008 0000 0000 0000 0000 0000500 0000 0002 0000 0000 0000 0000 0000 0000 0000520 0300 0002 0000 0017 0000 0000 0000 0000 0000540 1000 0000 0000 001d 0000 0000 0000 0000 0000560 1000 0000 0000 0024 0000 0000 0000 0078 0000600 1200 0002 0066 6163 742e 6300 6763 6332 0000620 5f63 6f6d 7069 6c65 642e 002e 756d 756c 0000640 0070 7269 6e74 6600 6d61 696e 0000 0000 0000660 0000 003c 0000 0507 0000 0000 0000 005c 0000700 0000 0209 0000 0000 0000 0060 0000 020c 0000720 0000 0000 0000 0068 0000 0607 0000 0000 0000740 0061 733a 2057 6f72 6b53 686f 7020 436f 0000760 6d70 696c 6572 7320 342e 3220 6465 7620 0001000 3133 204d 6179 2031 3939 360a 0047 4343 0001020 3a20 2847 4e55 2920 322e 3935 2e32 2031 0001040 3939 3931 3032 3420 2872 656c 6561 7365 0001060 2900 0000 0000 0000 0000 0000 0000 0000 0001100 0000 0000 0000 0000 0000 0000 0000 0000 0001120 0000 0000 0000 0000 0000 0000 0000 0001 0001140 0000 0003 0000 0000 0000 0000 0000 0034 0001160 0000 003d 0000 0000 0000 0000 0000 0001 0001200 0000 0000 0000 000b 0000 0001 0000 0006 0001220 0000 0000 0000 0074 0000 0078 0000 0000 0001240 0000 0000 0000 0004 0000 0000 0000 0011 0001260 0000 0001 0000 0002 0000 0000 0000 00f0 0001300 0000 0011 0000 0000 0000 0000 0000 0008 0001320 0000 0000 0000 0019 0000 0002 0000 0002 0001340 0000 0000 0000 0104 0000 0080 0000 0005 0001360 0000 0005 0000 0004 0000 0010 0000 0021 0001400 0000 0003 0000 0002 0000 0000 0000 0184 0001420 0000 0029 0000 0000 0000 0000 0000 0001 0001440 0000 0000 0000 0029 0000 0004 0000 0002 0001460 0000 0000 0000 01b0 0000 0030 0000 0004 0001500 0000 0002 0000 0004 0000 000c 0000 0034 0001520 0000 0001 0000 0000 0000 0000 0000 01e0 0001540 0000 0052 0000 0000 0000 0000 0000 0001 0001560 0000 0000 0001564 ================================================================ APPENDIX D: disassembly of fact.o (produced by: dis fact.o) **** DISASSEMBLER **** section .text main() 0: 9d e3 bf 88 save %sp, -120, %sp 4: f0 27 a0 44 st %i0, [%fp + 68] 8: f2 27 a0 48 st %i1, [%fp + 72] c: 90 10 20 01 mov 1, %o0 10: d0 27 bf e8 st %o0, [%fp - 24] 14: 90 10 20 01 mov 1, %o0 18: d0 27 bf ec st %o0, [%fp - 20] 1c: d0 07 bf ec ld [%fp - 20], %o0 20: 80 a2 20 05 cmp %o0, 5 24: 04 80 00 04 ble 0x34 28: 01 00 00 00 nop 2c: 10 80 00 0c ba 0x5c 30: 01 00 00 00 nop 34: d0 07 bf e8 ld [%fp - 24], %o0 38: d2 07 bf ec ld [%fp - 20], %o1 3c: 40 00 00 00 call 0x3c 40: 01 00 00 00 nop 44: d0 27 bf e8 st %o0, [%fp - 24] 48: d0 07 bf ec ld [%fp - 20], %o0 4c: 92 02 20 01 add %o0, 1, %o1 50: d2 27 bf ec st %o1, [%fp - 20] 54: 10 bf ff f2 ba 0x1c 58: 01 00 00 00 nop 5c: 13 00 00 00 sethi %hi(gcc2_compiled.), %o1 60: 90 12 60 00 or %o1, gcc2_compiled., %o0 ! gcc2_compiled. 64: d2 07 bf e8 ld [%fp - 24], %o1 68: 40 00 00 00 call 0x68 6c: 01 00 00 00 nop 70: 81 c7 e0 08 ret 74: 81 e8 00 00 restore