21 November 2010

Handling source in EMSIP

EMSIP source parser

I split this post in two. I was originally going to talk about EMSIP as a whole, which I do here. But right now I want to talk about one of those little parsers, the source code parser, which is designed to handle the main flow of source code.


  1. Simplicity
  2. Follow EMSIP's framework and spirit
  3. Where there exists a compelling "write it without thinking" expression for something, allow it.
    • Exception: infix syntax for algebra, which would conflict with (2)
    • For instance, allow:
      • quoting of expressions
      • distinguishing numbers from symbols just because of how they look.
      • prefixing numbers with minus sign, eg -6.
    • I emphasize that this rationale is limited. If you would have to think even for a moment about how to write an expression, it doesn't fall under this clause. This is only for expressions that people tend to automatically write before thinking. It's not meant for tending-towards-Perl shortcuts.

The objects to express

As I mentioned, essentially all the complexity of source code would be pushed out to little parsers:

  • comments
  • strings
  • radix numbers such as hex numbers
  • complex numbers
  • rationals
  • bytevectors
  • vectors
  • character escapes
  • (etc)

And obviously they're each pretty simple.

Relieved of the burden of representing every strange object, I think we can get by with just a few simple things:

  • decimal numbers, essentially as traditional.
    • Introduced by a digit, or -, + followed by a digit.
    • Continued by a digits, letters, and some punctuation.
    • We'd support just decimal integers and inexact decimal floats. Not much more - maybe exact floats. No attempt to represent complex numbers, rationals, numbers of various precision, etc. Those are to be supported by little parsers.
  • symbols, essentially as traditional.
    • Introduced by a letter. Also by -, +, $, or -> followed by a letter. (We'll extend this group lower down)
    • Continued by a digits, letters, and some punctuation.
  • Punctuation groups
    • Introduced by quote, even in non-whitespace states

      For regularity, I'll say that - or + followed by punctuation can start a punctuation group.

    • Continued by other punctuation. Not continued by quote; quote starts another punctuation group.
    • Purpose: "funnier" manipulations of source, such as quoting the next sexp or dotting a list.
  • Whitespace

So the first character after whitespace, a bracket, or punctuation controls the type, with a few exceptions which must consider their next character.

In table form:

typeBegun byContinued by
numberusually by digitalmost anything
symbolusually by letteralmost anything
punctuationjust quotejust the leftovers

The classes of character

That gives us these classes of character:

(not brackets)
brackets are accounted for by EMSIP main.
a-z, A-Z
strong punctuation
just single-quote '
weak constituents
  • Includes definitely +, -, >, <, $, and probably all the other punctuation.
(illegal characters)
characters that just shouldn't appear.

The parse states

No we have just these major parse states:

Entered by weak constituents.

This state can be terminated by anything except weak constituents.

Becomes part of something else depending on how it's terminated.

  • By digit: becomes part of a number
  • By letter: becomes part of a symbol
  • By strong punctuation: becomes part of a punctuation group
  • By whitespace or bracket: makes a symbol of just itself.
Entered by whitespace.

This state can be interrupted by anything except whitespace.

Makes a whitespace object whose tail transformer obliterates it.

punctuation group
Entered by strong punctuation.

Makes a punctuation group.

This state can be terminated by anything except weak constituents.

This is two similar states, number and symbol.

Entered by digit or letter

This state can be interrupted only by whitespace, strong punctuation, or closing bracket.

Makes a number or symbol

In table form:

StateMakesEntered byInterruptable by
Uncommitted(Depends)Weak constituentsAll but weak cons
WhitespacewhitespacewhitespaceAll but whitespace
PunctuationPunctuation groupStrong punctuationAll but weak cons
Committednumber or symboldigit or letterWhitespace, strong cons

suitable SREs

These SREs1 assume the classes of character as primitives. They assume longest-match parsing so `uncommitted' will be part of a larger construct when possible.

(* weak-constituent)
(* whitespace)
(seq strong-punctuation (* weak-constituent))
(seq digit (* (~ strong-punctuation whitespace)))
(seq letter (* (~ strong-punctuation whitespace)))
(seq (? uncommitted) committed-number)
(seq (? uncommitted) committed-symbol)
(seq (? uncommitted) punctuation-state)
(* (or whitespace number symbol punctuation uncommitted))
(* thing)

Operating on promised tails

One thing about EMSIP parsers is that they include transformers that operate on a promise of their tail. This lets us do some important things neatly, such as skip comments and splice character constants into strings. It also implies that read gives us a stream of objects, rather than returning one object and being called repeatedly. We'd use essentially this mechanism for some business in the source parser.

Whitespace objects use this mechanism to remove themselves. Their tail transformers just return the tail.

Punctuation groups will also operate on promises of their tails. This lets them handle all the "funny business" that traditional sexp syntax does with quote and its friends.

Numbers and symbols don't do anything special, they just cons themselves onto the tail.

And that's it!

And that's all we have to do to get a reasonable source parser under EMSIP.


1 For more on SREs, see SRE or a very similar idea, Irregexes

No comments:

Post a Comment