Tehom: Handling source in EMSIP

EMSIP source parser

I split this post in two. I was originally going to talk about EMSIP as a whole, which I do here. But right now I want to talk about one of those little parsers, the source code parser, which is designed to handle the main flow of source code.

Objectives

Simplicity
Follow EMSIP's framework and spirit
Where there exists a compelling "write it without thinking" expression for something, allow it.
- Exception: infix syntax for algebra, which would conflict with (2)
- For instance, allow:
  - quoting of expressions
  - distinguishing numbers from symbols just because of how they look.
  - prefixing numbers with minus sign, eg -6.
- I emphasize that this rationale is limited. If you would have to think even for a moment about how to write an expression, it doesn't fall under this clause. This is only for expressions that people tend to automatically write before thinking. It's not meant for tending-towards-Perl shortcuts.

The objects to express

As I mentioned, essentially all the complexity of source code would be pushed out to little parsers:

comments
strings
radix numbers such as hex numbers
complex numbers
rationals
bytevectors
vectors
character escapes
(etc)

And obviously they're each pretty simple.

Relieved of the burden of representing every strange object, I think we can get by with just a few simple things:

decimal numbers, essentially as traditional.
- Introduced by a digit, or -, + followed by a digit.
- Continued by a digits, letters, and some punctuation.
- We'd support just decimal integers and inexact decimal floats. Not much more - maybe exact floats. No attempt to represent complex numbers, rationals, numbers of various precision, etc. Those are to be supported by little parsers.
symbols, essentially as traditional.
- Introduced by a letter. Also by -, +, $, or -> followed by a letter. (We'll extend this group lower down)
- Continued by a digits, letters, and some punctuation.
Punctuation groups
- Introduced by quote, even in non-whitespace states
  For regularity, I'll say that - or + followed by punctuation can start a punctuation group.
- Continued by other punctuation. Not continued by quote; quote starts another punctuation group.
- Purpose: "funnier" manipulations of source, such as quoting the next sexp or dotting a list.
Whitespace

So the first character after whitespace, a bracket, or punctuation controls the type, with a few exceptions which must consider their next character.

In table form:


type	Begun by	Continued by
number	usually by digit	almost anything
symbol	usually by letter	almost anything
punctuation	just quote	just the leftovers
whitespace	whitespace	whitespace

The classes of character

That gives us these classes of character:

(not brackets)

brackets are accounted for by EMSIP main.

letters

a-z, A-Z

digits

0-9

whitespace

whitespace

strong punctuation

just single-quote '

weak constituents

Includes definitely +, -, >, <, $, and probably all the other punctuation.

(illegal characters)

characters that just shouldn't appear.

The parse states

No we have just these major parse states:

uncommitted

Entered by weak constituents.

This state can be terminated by anything except weak constituents.

Becomes part of something else depending on how it's terminated.

By digit: becomes part of a number
By letter: becomes part of a symbol
By strong punctuation: becomes part of a punctuation group
By whitespace or bracket: makes a symbol of just itself.

whitespace

Entered by whitespace.

This state can be interrupted by anything except whitespace.

Makes a whitespace object whose tail transformer obliterates it.

punctuation group

Entered by strong punctuation.

Makes a punctuation group.

This state can be terminated by anything except weak constituents.

committed

This is two similar states, number and symbol.

Entered by digit or letter

This state can be interrupted only by whitespace, strong punctuation, or closing bracket.

Makes a number or symbol

In table form:


State	Makes	Entered by	Interruptable by
Uncommitted	(Depends)	Weak constituents	All but weak cons
Whitespace	whitespace	whitespace	All but whitespace
Punctuation	Punctuation group	Strong punctuation	All but weak cons
Committed	number or symbol	digit or letter	Whitespace, strong cons

suitable SREs

These SREs¹ assume the classes of character as primitives. They assume longest-match parsing so `uncommitted' will be part of a larger construct when possible.

uncommitted

(* weak-constituent)

whitespace

(* whitespace)

punctuation-state

(seq strong-punctuation (* weak-constituent))

committed-number

(seq digit (* (~ strong-punctuation whitespace)))

committed-symbol

(seq letter (* (~ strong-punctuation whitespace)))

number

(seq (? uncommitted) committed-number)

symbol

(seq (? uncommitted) committed-symbol)

punctuation

(seq (? uncommitted) punctuation-state)

thing

(* (or whitespace number symbol punctuation uncommitted))

(overall)

(* thing)

Operating on promised tails

One thing about EMSIP parsers is that they include transformers that operate on a promise of their tail. This lets us do some important things neatly, such as skip comments and splice character constants into strings. It also implies that read gives us a stream of objects, rather than returning one object and being called repeatedly. We'd use essentially this mechanism for some business in the source parser.

Whitespace objects use this mechanism to remove themselves. Their tail transformers just return the tail.

Punctuation groups will also operate on promises of their tails. This lets them handle all the "funny business" that traditional sexp syntax does with quote and its friends.

Numbers and symbols don't do anything special, they just cons themselves onto the tail.

And that's it!

And that's all we have to do to get a reasonable source parser under EMSIP.

Footnotes:

¹ For more on SREs, see SRE or a very similar idea, Irregexes

21 November 2010

Handling source in EMSIP

EMSIP source parser

Objectives

The objects to express

The classes of character

The parse states

suitable SREs

Operating on promised tails

And that's it!

Footnotes:

No comments:

Post a Comment

Followers

Blog Archive

About Me