21 February 2011

Emsip sexps redux

Emsip sexps redux

A few months ago I blogged about Emsip. I proposed that an Emsip interpreter could neatly handle source code by farming the specialized parts off to special {}-interpreters and handling the core {numbers,symbols,punctuation} with just a few classes and states.

It occurs to me that my design might be made even simpler. Instead of punctuation grouping at the beginning of a sequence of weak constitutents, let it group at the end.

That lets the object-making states be almost entirely regular. There are no longer special conditions for what interrupts them, except for the state Uncommitted, which is interrupted by anything other than a weak constitutent.

That means that punctuation can appear directly before symbols,

'a   +:'b 
;;quoted symbol "a", "+:'" applied to symbol "b".

but not directly after, because the punctuation constitutents will be incorporated in:

a'   b+:' 
;;Two symbols, "a'" and "b+:'"

This is reasonable, since it covers the common use case and because the purpose of punctuation is to modify following sexps, not preceding ones.

The new design

We have these states:

StateInterruptable by
No objectAll but whitespace
UncommittedAll but weak cons
CommittedOnly whitespace

The classes of character are the same as before, and now have these transition behaviors:

ClassObject begunNew state
WhitespaceNothingNo object
Weak constituent(Possible symbol)Uncommitted
Strong punctuationXforms the tailNo object
LetterSymbolCommitted
DigitNumberCommitted

We make an object when we enter the No-object state. But that is really a reversed reversal. The real rule is to make a new object when we leave the union of the other two states. So to nail down the details:

  • Beginning to read the sexp doesn't make an object, even though we start reading in No-object state.
  • Leaving the sexp makes an object unless state was No-object. So it's as if the sexp read is surrounded by No-object state.
  • Strong punctuation makes an object even if previous state was No-object, as if we pass thru the Committed state momentarily before re-entering the No-object state.
  • Nested reads (which are handled by EMSIP outside of here) enter No-object before they begin, make one object, and leave state as No-object.

suitable SREs

I'll update the SREs as well. And I'll trivially add `nested-object', which I left out last time. It would be supplied by EMSIP framework.

These SREs assume `nested-object' and the classes of character as primitives. They assume longest-match parsing so `uncommitted' will be part of a larger construct when possible (only the symbol second alternative is at issue)

whitespace
(+ whitespace-constituent)
uncommitted
(+ weak-constituent)
number
(seq (? uncommitted) digit (* (~ whitespace-constituent)))
symbol
(or (seq (? uncommitted) letter (* (~ whitespace-constituent))) uncommitted)
punctuation
(seq (? uncommitted) strong-punctuation)
thing
(or whitespace number symbol punctuation nested-object)
(overall)
(* thing)

Notes on numbers

When I say a "number" object is made, I don't mean that "number" must be a primitive type. On the contrary, I expect distinct types including at least integers, exact reals, and inexact reals.

What I mean is that all numbers are of this form. That substring is parsed by another very specialized regular-expression matcher.

Since we can just farm any misfits off to specialized {}-interpreters, we can afford to be very picky about what we treat here. My thoughts on what to support:

Positive vs negative
Of course.
Decimal point
Of course.
Scientific notation
I think so.
Bases
Maybe just decimal and hex.
The infinities and NaN
No. They are semantically numbers but have no digits. They gain nothing by sharing their syntax with this.
Rationals
Maybe. They are convenient.
Inexact vs exact
Probably. A default to exact integers and inexact reals is reasonable. But one would sometimes want exact reals, and it might be a nuisance to have to switch notations just for that. We can't use prefixes #e and #i since they make us read a symbol instead of a number, and trailing "e" is used by scientific notation. So perhaps trailing "+-" could indicate inexactness.
Uncertainty
Maybe. Having "+-", it's tempting to use "+-" followed by a (positive, exact) number to indicate uncertainty.
Precision
Ie, floats vs doubles vs perhaps bignums. No. That's really about object construction. If that sort of control is wanted, real object construction should be used, not syntax tricks.
Digit group separators
As in "1,000,000". Probably, since it almost falls out of this syntax. Since the comma isn't universally accepted in this role, let's accept the underscore too. But IMO going any further would add a great deal of complexity. So no provisions to treat the comma as decimal point in some locales, etc.

No comments:

Post a Comment