EMSIP source parser
I split this post in two. I was originally going to talk about EMSIP as a whole, which I do here. But right now I want to talk about one of those little parsers, the source code parser, which is designed to handle the main flow of source code.
Objectives
- Simplicity
- Follow EMSIP's framework and spirit
-
Where there exists a compelling "write it without thinking"
expression for something, allow it.
- Exception: infix syntax for algebra, which would conflict with (2)
-
For instance, allow:
- quoting of expressions
- distinguishing numbers from symbols just because of how they look.
- prefixing numbers with minus sign, eg -6.
- I emphasize that this rationale is limited. If you would have to think even for a moment about how to write an expression, it doesn't fall under this clause. This is only for expressions that people tend to automatically write before thinking. It's not meant for tending-towards-Perl shortcuts.
The objects to express
As I mentioned, essentially all the complexity of source code would be pushed out to little parsers:
- comments
- strings
- radix numbers such as hex numbers
- complex numbers
- rationals
- bytevectors
- vectors
- character escapes
- (etc)
And obviously they're each pretty simple.
Relieved of the burden of representing every strange object, I think we can get by with just a few simple things:
-
decimal numbers, essentially as traditional.
- Introduced by a digit, or -, + followed by a digit.
- Continued by a digits, letters, and some punctuation.
- We'd support just decimal integers and inexact decimal floats. Not much more - maybe exact floats. No attempt to represent complex numbers, rationals, numbers of various precision, etc. Those are to be supported by little parsers.
-
symbols, essentially as traditional.
- Introduced by a letter. Also by -, +, $, or -> followed by a letter. (We'll extend this group lower down)
- Continued by a digits, letters, and some punctuation.
-
Punctuation groups
-
Introduced by quote, even in non-whitespace states
For regularity, I'll say that - or + followed by punctuation can start a punctuation group.
- Continued by other punctuation. Not continued by quote; quote starts another punctuation group.
- Purpose: "funnier" manipulations of source, such as quoting the next sexp or dotting a list.
-
Introduced by quote, even in non-whitespace states
- Whitespace
So the first character after whitespace, a bracket, or punctuation controls the type, with a few exceptions which must consider their next character.
In table form:
type | Begun by | Continued by |
---|---|---|
number | usually by digit | almost anything |
symbol | usually by letter | almost anything |
punctuation | just quote | just the leftovers |
whitespace | whitespace | whitespace |
The classes of character
That gives us these classes of character:
- (not brackets)
- brackets are accounted for by EMSIP main.
- letters
- a-z, A-Z
- digits
- 0-9
- whitespace
- whitespace
- strong punctuation
- just single-quote '
- weak constituents
-
- Includes definitely +, -, >, <, $, and probably all the other punctuation.
- (illegal characters)
- characters that just shouldn't appear.
The parse states
No we have just these major parse states:
- uncommitted
-
Entered by weak constituents.
This state can be terminated by anything except weak constituents.
Becomes part of something else depending on how it's terminated.
- By digit: becomes part of a number
- By letter: becomes part of a symbol
- By strong punctuation: becomes part of a punctuation group
- By whitespace or bracket: makes a symbol of just itself.
- whitespace
-
Entered by whitespace.
This state can be interrupted by anything except whitespace.
Makes a whitespace object whose tail transformer obliterates it.
- punctuation group
-
Entered by strong punctuation.
Makes a punctuation group.
This state can be terminated by anything except weak constituents.
- committed
-
This is two similar states, number and symbol.
Entered by digit or letter
This state can be interrupted only by whitespace, strong punctuation, or closing bracket.
Makes a number or symbol
In table form:
State | Makes | Entered by | Interruptable by |
---|---|---|---|
Uncommitted | (Depends) | Weak constituents | All but weak cons |
Whitespace | whitespace | whitespace | All but whitespace |
Punctuation | Punctuation group | Strong punctuation | All but weak cons |
Committed | number or symbol | digit or letter | Whitespace, strong cons |
suitable SREs
These SREs1 assume the classes of character as primitives. They assume longest-match parsing so `uncommitted' will be part of a larger construct when possible.
- uncommitted
-
(* weak-constituent)
- whitespace
-
(* whitespace)
- punctuation-state
-
(seq strong-punctuation (* weak-constituent))
- committed-number
-
(seq digit (* (~ strong-punctuation whitespace)))
- committed-symbol
-
(seq letter (* (~ strong-punctuation whitespace)))
- number
-
(seq (? uncommitted) committed-number)
- symbol
-
(seq (? uncommitted) committed-symbol)
- punctuation
-
(seq (? uncommitted) punctuation-state)
- thing
-
(* (or whitespace number symbol punctuation uncommitted))
- (overall)
-
(* thing)
Operating on promised tails
One thing about EMSIP parsers is that they include transformers that operate on a promise of their tail. This lets us do some important things neatly, such as skip comments and splice character constants into strings. It also implies that read gives us a stream of objects, rather than returning one object and being called repeatedly. We'd use essentially this mechanism for some business in the source parser.
Whitespace objects use this mechanism to remove themselves. Their tail transformers just return the tail.
Punctuation groups will also operate on promises of their tails. This lets them handle all the "funny business" that traditional sexp syntax does with quote and its friends.
Numbers and symbols don't do anything special, they just cons themselves onto the tail.
And that's it!
And that's all we have to do to get a reasonable source parser under EMSIP.
No comments:
Post a Comment