Short answer:
The fundamental reading list for a lisp interpreter is SICP. I would not at all call it overkill, if you feel you are overqualified for the first parts of the book jump to chapter 4 and start interpreting away (although I feel this would be a loss since chapters 1-3 really are that good!).
Add LISP in Small Pieces (LISP from now on), chapters 1-3. Especially chapter 3 if you need to implement any non-trivial control forms.
See this post by Jens Axel Søgaard on a minimal self-hosting Scheme: http://www.scheme.dk/blog/2006/12/self-evaluating-evaluator.html .
A slightly longer answer:
It is hard to give advice without knowing what you require from your interpreter.
- does it really really need to be an interpreter, or do you actually need to be able to execute lisp code?
- does it need to be fast?
- does it need standards compliance? Common Lisp? R5RS? R6RS? Any SFRIs you need?
If you need anything more fancy than a simple syntax tree walker I would strongly recommend embedding a fast scheme subsystem. Gambit scheme comes to mind: http://dynamo.iro.umontreal.ca/~gambit/wiki/index.php/Main_Page .
If that is not an option chapter 5 in SICP and chapters 5-- in LISP target compilation for faster execution.
For faster interpretation I would take a look at the most recent JavaScript interpreters/compilers. There seem to be a lot of thought going into fast JavaScript execution, and you can probably learn from them. V8 cites two important papers: http://code.google.com/apis/v8/design.html and squirrelfish cites a couple: http://webkit.org/blog/189/announcing-squirrelfish/ .
There is also the canonical scheme papers: http://library.readscheme.org/page1.html for the RABBIT compiler.
If I engage in a bit of premature speculation, memory management might be the tough nut to crack. Nils M Holm has published a book "Scheme 9 from empty space" http://www.t3x.org/s9fes/ which includes a simple stop-the-world mark and sweep garbage collector. Source included.
John Rose (of newer JVM fame) has written a paper on integrating Scheme to C: http://library.readscheme.org/servlets/cite.ss?pattern=AcmDL-Ros-92 .
Matt's explanation is perfectly fine -- and he takes a shot at a comparison to C and Java, which I won't do -- but for some reason I really enjoy discussing this very topic once in a while, so -- here's my shot at an answer.
On points (3) and (4):
Points (3) and (4) on your list seem the most interesting and still relevant now.
To understand them, it is useful to have a clear picture of what happens with Lisp code -- in the form of a stream of characters typed in by the programmer -- on its way to being executed. Let's use a concrete example:
;; a library import for completeness,
;; we won't concern ourselves with it
(require '[clojure.contrib.string :as str])
;; this is the interesting bit:
(println (str/replace-re #"\d+" "FOO" "a123b4c56"))
This snippet of Clojure code prints out aFOObFOOcFOO
. Note that Clojure arguably does not fully satisfy the fourth point on your list, since read-time is not really open to user code; I will discuss what it would mean for this to be otherwise, though.
So, suppose we've got this code in a file somewhere and we ask Clojure to execute it. Also, let's assume (for the sake of simplicity) that we've made it past the library import. The interesting bit starts at (println
and ends at the )
far to the right. This is lexed / parsed as one would expect, but already an important point arises: the result is not some special compiler-specific AST representation -- it's just a regular Clojure / Lisp data structure, namely a nested list containing a bunch of symbols, strings and -- in this case -- a single compiled regex pattern object corresponding to the #"\d+"
literal (more on this below). Some Lisps add their own little twists to this process, but Paul Graham was mostly referring to Common Lisp. On the points relevant to your question, Clojure is similar to CL.
The whole language at compile time:
After this point, all the compiler deals with (this would also be true for a Lisp interpreter; Clojure code happens always to be compiled) is Lisp data structures which Lisp programmers are used to manipulating. At this point a wonderful possibility becomes apparent: why not allow Lisp programmers to write Lisp functions which manipulate Lisp data representing Lisp programmes and output transformed data representing transformed programmes, to be used in place of the originals? In other words -- why not allow Lisp programmers to register their functions as compiler plugins of sorts, called macros in Lisp? And indeed any decent Lisp system has this capacity.
So, macros are regular Lisp functions operating on the programme's representation at compile time, before the final compilation phase when actual object code is emitted. Since there are no limits on the kinds of code macros are allowed to run (in particular, the code which they run is often itself written with liberal use of the macro facility), one can say that "the whole language is available at compile time".
The whole language at read time:
Let's go back to that #"\d+"
regex literal. As mentioned above, this gets transformed to an actual compiled pattern object at read time, before the compiler hears the first mention of new code being prepared for compilation. How does this happen?
Well, the way Clojure is currently implemented, the picture is somewhat different than what Paul Graham had in mind, although anything is possible with a clever hack. In Common Lisp, the story would be slightly cleaner conceptually. The basics are however similar: the Lisp Reader is a state machine which, in addition to performing state transitions and eventually declaring whether it has reached an "accepting state", spits out Lisp data structures the characters represent. Thus the characters 123
become the number 123
etc. The important point comes now: this state machine can be modified by user code. (As noted earlier, that's entirely true in CL's case; for Clojure, a hack (discouraged & not used in practice) is required. But I digress, it's PG's article I'm supposed to be elaborating on, so...)
So, if you're a Common Lisp programmer and you happen to like the idea of Clojure-style vector literals, you can just plug into the reader a function to react appropriately to some character sequence -- [
or #[
possibly -- and treat it as the start of a vector literal ending at the matching ]
. Such a function is called a reader macro and just like a regular macro, it can execute any sort of Lisp code, including code which has itself been written with funky notation enabled by previously registered reader macros. So there's the whole language at read time for you.
Wrapping it up:
Actually, what has been demonstrated thus far is that one can run regular Lisp functions at read time or compile time; the one step one needs to take from here to understanding how reading and compiling are themselves possible at read, compile or run time is to realise that reading and compiling are themselves performed by Lisp functions. You can just call read
or eval
at any time to read in Lisp data from character streams or compile & execute Lisp code, respectively. That's the whole language right there, all the time.
Note how the fact that Lisp satisfies point (3) from your list is essential to the way in which it manages to satisfy point (4) -- the particular flavour of macros provided by Lisp heavily relies on code being represented by regular Lisp data, which is something enabled by (3). Incidentally, only the "tree-ish" aspect of the code is really crucial here -- you could conceivably have a Lisp written using XML.
Best Answer
'Homoiconic' is kind of a vague construct. 'code is data' is a bit clearer.
Anyway, the first sentence on Wikipedia for Homoiconic is not that bad. It says that the language has to have a source representation using its data structures. If we forget 'strings' as source representation (that's trivial and not that helpful to have a useful concept 'homoiconic'), then Lisp has lists, symbols, numbers, strings etc. which are used to represent the source code. The interface of the EVAL function determines what kind of source representation the language is working on. In this case, Lisp, it is not strings. EVAL expects the usual variety of data structures and the evaluation rules of Lisp determine that a string evaluates to itself (and thus will not be interpreted as a program expression, but just string data). A number also evaluates to itself. A list (sin 3.0) is a list of a symbol and a number. The evaluation rules say that this list with a symbol denoting a function as the first object will be evaluated as a function application. There are a few evaluation rules like this for data, special operators, macro applications and function applications. That's it.
To make it clear: in Lisp the function EVAL is defined over Lisp data structures. It expects a data structure, evaluates it according to its evaluation rules and returns a result - again using its data structures.
This matches the definition of homoiconic: source code has a native representation using the data types of Lisp.
Now, the interesting part is this: it does not matter how EVAL is implemented. All that matters is that it accepts the source code using the Lisp data structures, that it executes the code and that it returns a result.
So it is perfectly legal that EVAL uses a compiler.
That's how several Lisp system work, some don't even have an Interpreter.
So, 'Homoiconic' says that the SOURCE code has a data representation. It does NOT say that at runtime this source code has to be interpreted or that the execution is based on this source code.
If the code is compiled, neither the compiler nor an interpreter is needed at runtime. Those would only be needed if the program wants to eval or compile code at runtime - something that is often not needed.
Lisp also provides a primitive function READ, which translates an external representation (S-Expressions) of data into an internal representation of data (Lisp data). Thus it also can be used to translate an external representation of source code into an internal representation of source code. Lisp does not use a special parser for source code - since code is data, there is only READ.