Tehom: First notes on implementing EMSIP

Implementing EMSIP in Scheme

What

These are my older notes on implementing EMSIP in Scheme. They give a flavor of my thinking, warts and all.

There's a fair bit of rethinking in here, and many parts are just obsolete.

Some ideas

Iterate over a sequence of characters

This allows all the lazy reading we could want.

The iterated proc mostly just checks for magic characters

Or more carefully, for attention-string followed by any magic character, and for non-empty attention-strings, also wraps characters in a different way that other parts recognize.

It builds a lazy structure as follows:

A left bracket

Calls the respective reader. That's one of 3 predetermined readers, which mutate into the actual readers.
Feeds the current reader the return value of the new reader as an object. Since Scheme supports lazy evaluation, this can be done without order-of-evaluation problems.

A right bracket

(Not immediately needed) Check for bracket agreement; error if bracket does not agree with initial bracket.
Terminates the current reader's input
- And (as above) terminates the current read scope.

???

The comma is not special here, it's just treated specially by 2 of the built-in readers.

???

(Not immediately needed) For a non-magic character following a non-empty attention-string, feed it to the current reader as a semi-magic char.

Any other character

Feed it to the current reader as a simple char.

Mode of iteration

We can't just loop thru the characters, because different conditions apply at different points, and these conditions are hierarchical.

Instead, we are reading with specific state data, and sometimes recursing into another call that "uses up" a stretch of characters.

So each call would:

Take the initial stream
Take the state data
Return the remaining stream
Return the object that was read

So this call corresponds to a nested reader. It gets or immediately sets up a reader and uses that reader for its whole (non-nested) scope.

And we use this by:

When we call a new reader,
- Give it state data
  - probably including the new reader
- replace the stream with its promised return
- Use its object return as the value to feed to the current reader.

To make the laziness work wrt the end-bracket termination, we'll pass the current continuation, and the iterator never stops, it just calls that continuation.

We also want to fill the list in a natural way. That suggests that rather than iterating as such, we are filling a list and making further promises.

Argument structure of the built-in readers

Takes a lazy stream
- Element types can be, all distinct:
  - simple character
  - (Not immediately needed) semi-magic character
  - object, as returned by some other reader.
  - A promise which may be any of the above
  - A discriminated promise, which knows which type it is but not what it is.
  - And (not formally an element) end-of-list.
May also take further arguments, all objects:
- Usually, as when reading a list, takes the next object as a promise
  - Must allow an indication of termination, possibly #f.
  - When possible, should treat that object as opaque; don't force it as a promise. That's not always possible, which is why that object would be an argument.
Returns an object

Flow control

Let them be promises.

Let both object and stream be promises.

Order of evaluation

Wall-clock order of evaluation of a nested call:
- Call nested emsip-read
- Nested emsip-read formally returns
- Current read goes on to the remainder of the read (but may not have the stream)
- Current read forces the stream
- Nested emsip-read tries to finish
- Nested emsip-read finishes stream, makes it available
- Nested emsip-read may ask for some or all of current read's remainder of the read (often just the first object)
- Current read, certainly now knowing stream, reads further,

Pass xformers around.

The returned object becomes a chain of promised xformers.

What nested read returns

Of course its final state of the stream. But what else to return is conflicted:

OT1H, wants to return the object, letting reader treat it any way that it likes as it builds its ultimate returned object.
OTOH, wants to call an xformer on the object and a tail object.

We could let it return either, and reader can use it as an xformer if that's appropriate. So reader must know what it's getting. And reader must supply itself an xformer when none is provided.

Or return a tagged thing, arbitrary tag that other readers should understand.

Or predefine some tags, so it can be `xformer' or `object'. And every reader tells us which it supplies - really, we pass it thru.

It is up to readers to be suitably lazy and to call xformers.

Optimization: Skipping forward

Sometimes we'd like to can skip past characters we know aren't relevant. Some cases:

Non-magic characters, skip to the repective closing bracket. This reader's value remains just a promise and may yet be forced.
General skip-past-chars as dictated by reader, but of course still never skipping past magic characters.
- So the other is a special case of this.

How to communicate it? Partly just by not forcing reads. Perhaps also as a read parameter.

Rethink: May need to allow more nested levels

The simple one that only sees comma as separator. This will be built by `emsip-make-list-reader'. One xformer only applies at single level here. This is basically for "repeat". But there are 2 levels in our basic reader: Objects and characters. Or we could just accumulate characters into lists. What's not nice is that the 2 levels are entangled with each other. Maybe intercept just comma and magicify it, then have really 2 levels. Init the inner xformer with `list->string'. Does the outer level (a) LEANING_TOWARDS pass in a false end, or (b) NOT_CHOSEN act after the inner level passes on something? And it would make an object, as any other level does. But dual levels is just what I hoped to avoid here.

Perhaps reader would tell upper level about separators?

But it's really alternating string-constitutent and non-string-constitutent.

So very likely a new design: Push-down layers which always include the brackets. But comma is just a separator. It pops (like an end bracket), but higher level immediately pushes another. And hopefully the basic readers won't need anything more than that and sequence (which also pushes and makes non-match pop).

Readers forcing downwards

The previous item demonstrates that readers may want to immediately start a nested read. That gives opaques still a chance to organize strings rather than having pieces embedded in them. Do this before end-brackets - repeat-readers do expect to go to a second level first. But they shouldn't eat the forthcoming char

Repeat-readers do this. Separating repeat-readers don't have a way to return zero elements, nor should they.

Checking for matching brackets

Built-ins simply know the end-bracket char they expect. Readers probably shouldn't set up their own end-char expectations, and even if they did, they are passed their starting char which suffices to figure out the end char.

So there's nothing special to be done when starting a nested read.

Maybe something special to be done when ending one. That's most neatly done by passing the final char to `afterwards', the termination communication function.

Raise error when `emsip-object-stream-force1' gets empty stream? No.

Do not raise error when `emsip-object-stream-force1' gets an empty stream, because this condition might occur in innocent places such as a forced tail after a comment.

I may rethink that, though.

How to return the state of the stream

In normal use, `emsip-object-stream-force1' returns all the info it needs to. It returns a stream with the car forced, and the tail promise knows the state of the input stream.

But when it hits an end-bracket which ends the output stream, or hits the end of the input stream, it needs to communicate the new state of the input stream. We can no longer use the return value to communicate it, because that has to be the null stream.

Would be nice to have:

This manages bracket-match checking
This gives error when it is not set up, corresponding to a read that popped past the top level.

Possible approaches:

Call a continuation of something in the caller. emsip-object-stream-force1 would call it just when it hit an end, passing the new state of the input stream.
The problem is that we need control to exit this call by its normal route so that the output stream gets assigned its right (forced) value.
Return two values, where the first is the output stream and the second one (usually redundant) is the input stream.
It's also difficult or impossible to ensure that the calls we're interested in gets the values. In many cases there are other callers, usually delayed intermediate calls.

And it's ugly to always return a value that usually isn't used and is usually redundant to another value.
Set a boxed value to the new state of the input stream.
That's clumsy for the caller to set up.
Call a function that sets the value.

31 August 2010

First notes on implementing EMSIP