20 December 2010

External testing frameworks revisited

External testing frameworks revisited

Last week I blogged about external testing frameworks. Summary: There are (were) no good ones.

While writing the post, it occured to me how I could extend my own Emtest testing framework to do this sort of thing. I hadn't designed Emtest for this sort of work, but I had designed it to be very modular and extensible.

Since then, I have gone ahead and added a runner to Emtest that does external testing. So I have to modify my previous position. This one, while it's young, is usable and you don't have to write in TCL.

What it is

It's called "external.el" and it's rather like "expect", but all in Elisp.

Its test cases look almost the same as the existing test cases. They are declared in `deftest-3', same as the others. They take a governor of `external' rather than `nil'. The test clause itself consists of, first, a list of bindings, and then a series of interaction forms. The bindings tell how to launch the external process and give other config information. The interaction forms each start with a string to be sent to the external process. They typically also contain a form that receives the response and tests it. The form can be omitted for prompts whose response is not tested.

I am already using this for testing Klink and it's much nicer to use the DejaGNU was, and works whereas Greg doesn't work here.

org2blog There's another org2blog

There's another org2blog

Yesterday I found out that there's another emacs package called org2blog, just like mine. It does roughly the same task, too, just it publishes to Wordpress instead of to Blogger and other Atom API sites.

It looks like its conventions are almost all different. The only thing really the same is the "#+TITLE: " file property.

I emailed the author, suggesting that we should at least try to co-ordinate. I haven't heard back yet.

16 December 2010

Stow meets design docs

Can the stow model help me write design documents?

The problem

I like to document my designs for my projects, such as Klink and Emtest. In theory writing down the things I decide or do should create a record of the design.

But I always face a problem. The issues, insights and decisions don't come in an organized manner. Quite the opposite. They trickle in a stream-of-consciousness manner, a few at a time, bearing no strong relation to the organization of the design. I want to bring order to this chaos without spending more time on it than it's worth.

I've tried several approaches:

  • Buckling down and putting every new thing into its proper, final place. After doing that for too long, my conclusion is that it's far, far too much work, especially navigating to the right spot all the time.
  • Separate files for major task areas: Design, decisions, progress. Before I switched to org mode, I wrote a meta-design of how to organize and some helper code and tried to stick to it. Again, after too much time and effort trying to keep it on track, I concluded that this just took far more work than it was worth.
  • Tried out John Wiegley's remember / planner / emacs-muse, but it really wasn't what I was looking for.
  • About this time I switched to Carsten Dominick's org-mode. It's wonderful for organizing text, but seemingly possesses no miracle big enough for this task.
  • Accumulating everything as notes and placing them in their final place in the outline later. Problems:
    • Later never comes.
    • Notes get lost in the big amorphous blob of notes.
    • Often I write a note that wants to modify the design, but that I might or might not ultimately accept. It isn't clear what should be done with it. I could move it into the design and maybe have to move it back out and try to restore the design to its old state, or leave it where it is, where it itself becomes an informal focus of later additions.

      I tend to choose the latter, so I am often finding recent notes and adding to them. This corrupts the file's organization too, so either way the design file degenerates.

  • Tagging each incoming note. I hoped that by doing this, I could filter on tags and thereby see all the items that should be moved to a place in the design section. However, that offers little for the pieces that are maybe in, maybe out.

    It's helpful up to a point. Theoretically I can filter on the tags, but in practice that hasn't done much for me. The tagset is simply nowhere near as fine-grained as my design outlines, and there's no reasonable way to make it so.

  • Splitting my design file template into information sections. "Design", "Issues", "Info", "Communication", "Progress", "Notes". Again, it's helpful up to a point, but doesn't solve my problem.

    One major problem is that items often evolve from one category to another. Eg, an issue will evolve into a task, which belongs in "Progress", or a design, which belongs in "Design". Progress items will spawn new issues, often as many as half a dozen. Decision items want to place one alternative into "Design" and archive the rest.

I'd really like to organize my designs even more strongly, separating out such facets as parts and behavior, and "cookbook" docs, etc. But without a much stronger approach, that's unlikely to happen.

What stow does

In order to describe my shiny new idea, I first have to describe how Stow works. Stow uses indirection. The parts are in one place (in subtrees of /stow/ or /usr/local/stow/) and the organized tree (/usr/local/bin/, /usr/local/man/man1/, etc) lives in a different place. Please ignore the fact that /usr/local/stow/ is really a location in the filetree - that's an inescapable bootstrap issue and poses little practical problem.

What stow does is tells the organized tree to hold the parts indirectly by symlinks. So the executable file in /usr/local/bin/my-exec really points at /usr/local/stow/this-package-0.1/bin/my-exec.

It figures out what to hold where from each package's internal organization. Ie, /usr/local/stow/this-package-0.1/bin/ will be symlinked from /usr/local/bin/, because after we get past the stow prefix /usr/local/stow/this-package-0.1/, what we see is bin/)

Stow's advantages include:

  • One can swap packages in and out very freely
  • Consequently nearly all of the pain of managing package versions disappears. You install them to use/local/stow and you stow the version you want. Upgrade? Unstow the old one, stow the new one. Didn't like the upgrade? Reverse the procedure. Don't like the whole package any more? Unstow it and delete the subdirectory.
  • Also, packages' install procedures can be much simplified, though most packages don't take advantage of this because they want to install on non-stow systems too.

A possible solution

Now, this could be another shiny idea that proves to be more work than it's worth. But it works neatly for stow, so maybe it will work here.


My approach would have these parts in analogy to stow:

/stow/ contents
Notes that mirror the existing structure of a document or subtree.
items that don't have their own contents but act as if they were other items.
Documents or outline items intended to contain the symlinks as well as some text themselves.
A command relating to a note, that causes appropriate "symlinks" to it to be created in /stow/ contents.

stowable notes

They are like org notes

These might be just ordinary notes, or they could be notes in a specific place. They should be creatable by org-remember, so they should live in a refile target.

Multiple stowables might live in a note

But it shouldn't be the case that each note corresponds to exactly one target location. A single coherent note may well want to add parts to different documents. Also, its target may be so deeply nested that mirroring the entire path to it would be very annoying.

So ISTM best for each stowable item to parameterize on a prefix. Presumably that prefix is stored in a property. ISTM the right interface is a command or commands to move singleton "directories" into and out of the prefix. Binding-wise, ISTM best to overload `org-shiftmetaright' and `org-shiftmetaleft' to do that, since their ordinary function is not useful here.

Having stowables not correspond to notes creates a problem that stow also has: How to recognize a parts directory when there can be more than one. We could just record "stowable", presumably as a tag, and not let new stowables encompass items tagged "stowable".

In stow, items don't need to know whether they are currently stowed or not. Here, for user convenience we'd probably want to at least tagged stowed items "stowed".

Indicating the mirror structure

My first thought is to just mirror the headlines. It works but it's fragile for split trees. If the user renames an item in the target document, suddenly we can't find it anymore. Let's also have a backup: store the org-id of any node that our mirror structure splits on.

Creating the mirror structure

But we still haven't talked about creating the mirror structure. It would be a PITA to do manually. Doing it supervised automatically doesn't seem difficult: It'd be much like the interaction like org-refile. Presumably we'd proceed in two stages:

  • Creating a stowable
    • Interactively find a root item
    • Make a fresh item with the same headline
    • Store the root's id.
    • Tag the fresh item "stowable"
  • Adding an item underneath a stowable
    • Choose a item in the respective target directory (whether our item is stowed or not)
    • Make a fresh item with the same headline
    • Store its id.

I'd also like support to repair the mirror structure, if it goes astray. It'd go something like

  • Visiting a possibly broken item:
    • Find where the org-id points.
      • No match, or no id? Uh-oh, that's not good.
        • Is there a headline match under the parent?
          • If so, fix the id.
          • Otherwise proceed as if creating a new item, just it will have old contents.
      • Match found: Fix the headline, if needed.

stow command

As with stow itself, there are a few issues under the surface:

  • Conflicts
  • Tree-splitting.

It's all fine to symlink /foo/bar/x into an empty directory, but what do we do when part of the path already exists in the target? If eg the target is /foo and not /.

Same thing that stow does: We split the tree. Sometimes to do so we must turn stowed "symlink" directories into real directories whose components are now symlinks. We may even have to do that recursively. Usually that suffices, because the user's intention was not to make competing, overlapping parts.

But sometimes it doesn't suffice. Two stowed parts might want to occupy the same position, and they can't both. Or maybe there is already text there.

Again we do as stow does: Refuse to stow, report the conflict(s), and asks the user to resolve them manually. That's acceptable. If the user's design document really has such a structural conflict, there's no magic here to fix it. Of course he will need to actually change whatever is making the conflict.

It might be handy to have a tool to pull existing text that participates in a conflict into a new note or notes. The user could then resolve the conflict just by unstowing that text.

Integration with org-choose

org-choose is my old project to "choose" among org items, where choosing one item makes its siblings non-selected. It would be very nice to automatically stow the item that is selected, and unstow the alternatives (but not in that order). This probably wants its own tag, perhaps "stow-choice".

But what about impure choices, where some items build on others? For instance, in this very post, the item "Do the same in place" builds on "Weave a separate read-only document". Pretend for a moment that they are full stow-choice items.

Stowing "Do the same in place" by itself wouldn't work; what's it the same as?. Stowing "Weave a separate read-only document" would be just wrong. If the two items have no structure, we're just out of luck. The second item will just have to be edited to be stand-alone.

So also pretend for a moment that the user has restructured both items cleanly: ie, each sub-item pertains entirely to one or entirely to both. Now there is something potentially stowable there. We just have to see how to do it right.

One thought is that the derived item points to be stowed in the base item, and if chosen, cascades to stow first the base and then itself. But this can't be entirely right. It implies that the derived item also overwrites the base item's text. That would be wrong, because what happens if we then choose the base item? Its text would be gone, replaced by that of the derived item. In any case, I don't contemplate using this mechanism to override text.

One could insist that the user rewrite further to remove conflicts, but the items are in fact alternatives. Sometimes we're lucky and one alternative builds on another in such a way that there is no conflict, but we can hardly rely on that. Making them never structurally conflict seems contrary to the spirit of alternative choices.

So instead, let this base/derived situation inform the choice of what to stow but not cause any base text to be overwritten. We'd recognize such situations just if a "stow-choice" alternative pointed into a "stow-choice" alternative in the same set.

09 December 2010



Yesterday I was reading some fascinating posts on sigfpe's blog A Neighborhood of Infinity. In particular, Fun With Derivatives Of Containers

Fascinating stuff. The rest of this post is going to assume that you have read it, so check out the link.

One part got me blogging: He says "Pity we can't write X.F'[X]-F[X] to mean just the address, but '-' makes no sense in this context". In another place (somewhere) he (or someone) says that neither subtraction nor division

I got to thinking, that's not entirely true. Consider the case of a list from which we intend to pop an element. The input an be any list except the empty list. Following his notation, the type of a list of X's is:

1 + X + X^2 + etc

where '+' is understood as a discriminated union. Or equivalently,


The type of the empty list is '1'. That's actually not specific; 1 is the type of which there can be only one instance. Similarly, a boolean flag is of type 2, an enumerated type of N elements is of time N.

So the possible inputs, list other than empty list, could be written:

sum(N,0,oo,X^N) - 1

And the type of `pop' as

sum(N,0,oo,X^N) - 1 -> sum(N,0,oo,X^N)

It's a little disconcerting that `pop' looks like "+1". When we look at it closer, we see that's not really the case. In type theory, objects don't have identity beyond that of their type signature, so we can't see that all of the objects have moved one place closer to the list head. We can, however, let the upper limit not be infinity but merely approach infinity. Letting M be that limit, we see that `pop's type signature is actually:

sum(N,0,M,X^N) - 1 -> sum(N,0,M-1,X^N)

which we can stretch to write as:


So we can have some semblance of division!

What's "e"?

Reading thru his other posts on the topic (highly recommended), I find that amazingly, the Taylor series for exponentiation is applicable to container types, ie

exp(X) = 1 + X/1! + X^2/2! + X^3/3! + etc

where a container of the form


is interpreted as a bag of N elements of type X. The "N!" comes from the fact that they are insensitive to permutations, ie, there are N! ways to compose a list of N elements of type X from those.

So I asked myself, what's the equivalent of "e" in this system? And the answer seems to be,

exp(1) = 1 + 1/1! + 1^2/2! + 1^3/3! + etc

A bag of any number of elements, each of which is void. Amazing.

02 December 2010

Can junk compaction be done with the Boehm GC?

Can junk compaction be done with the Boehm garbage collector?

When I first started this blog, I blogged about junk compaction, which is akin to garbage collection, but instead of throwing out objects that have no use at all, it transforms objects that have a lot of "scratch space" into a more compact form when they are inactive.

The Boehm garbage collector doesn't give us access to generational information, which was my original idea for implementing junk compaction. It does give us a fair amount of flexibility, though. So let's revise the idea: Don't try to figure out whether the junky object is stale. Just assume that it is stale if referenced only by junk-releasing references - akin to weak pointers.

The Boehm GC does give us an opening for this: Finalization can leave an object alive if it points to itself. And we can make weak links by hiding pointers.

First idea (not so hot)

  • Hold objects to be junk-collected only thru non-junk-releasing pointers and weak links.
  • So it will get finalized when it's no longer useful.
  • During finalization
    • Build the compact version (allocate it - is this OK during finalization in Boehm gc?)
    • Change the weak links into real links to the compact version.

The last is the hard part because it requires finding all links to the object, unhiding them, and changing them to point to the compact version. GCregisterdisappearinglink could probably do it "under the hood", but it doesn't expose any such functionality. It would have to be prearranged on every link anyways, which is the reverse of what we want: The non-junk-releasing pointers are special, not the junk-releasing ones.

Second idea: Handles

It could work if junk-collected objects are only held by handles and normal (non-junk-releasing) pointers. The handle would hold it by a weak pointer and it would point to the handle by a normal pointer. Thus if nothing is holding the object directly (not thru the handle), it will soon be finalized and thus compacted.

When finalized, it would make the handle point to the compact object and indicate this. And of course make the pointer to the handle disappear.

Third idea: Handles that become the compact object

This still leaves a superfluous handle. Is there some way to telescope it to actual reference? We must do so all at once, not piecemeal, so it doesn't mess up pointer comparison.

So instead, realloc the handle to be the size of the compact object and change it into that object.

This seems like it would work.

No good external testing frameworks

External testing frameworks, yuck!

Yesterday I looked at external testing frameworks. I didn't find a satisfactory one. Looking back, I'm not glad that I spent my time on that.

Klink is now at the point where it makes sense to script tests for it. But it's not yet ready for internal, self-driven tests. So I looked for an external testing frameworks. I tried several, and none were satisfactory.


Let's get this out of the way first. Emtest is my emacs test harness. It does not do external testing. Making it do so would take more time than I want to spend on this. So I didn't look at it for this purpose.

But maybe I should have

In retrospect, doing so might have had more satisfying results than spending my time exploring the others. Now I'm wondering whether I should go and do that. It could run the external program as an inferior process. The major work would be writing the expect functionality.

But that's another topic. Right now I'm grousing. Er, I mean reviewing the current state of external testing frameworks.

DejaGNU and Expect

DejaGNU does at least use POSIX-conformant test results. That's a format that I don't entirely agree with, but it's at least more reasonable than some formats out there. I could live with it.

Expect scripts are in TCL, and DejaGNU wants you to write tests in TCL. I had written an expect script years ago, back in the days of dialup. I vaguely remembered it being a PITA, but I thought that maybe my memory was wrong. It wasn't wrong, it was understated.

So I gave Greg a try.


Greg seemed a lot more promising than DejaGNU. It is scripted in Scheme. Scheme vs TCL? No contest!

It is a little bit odd to be using Scheme for a Kernel project, especially since it wants Guile linked in. But I don't think it would be hard to convert it to Kernel.

It's also easier to control than DejaGNU. The command line options are pretty direct, and the documentation is better.

Unfortunately, Greg is badly out of date now1. In particular, out of the box, it is not capable of setting up the terminal-pair that it needs.

I inspected the code and made it set up the terminal-pair. After that it semi-ran but got the same error on every single pipe. Clearly there's something else that needs to be updated, but now it's less obvious.

I know that Guile converted the function that gives the error to return a wchart some years back, so I suspect that it may be involved. But that's buried deep in Guile's innards and I don't want to fool with it.

The other reason why I'm not fixing Greg is that at the moment Savannah's web interface is down, so I can't see whethr there is a more recent version.

DejaGNU/Expect again

So I gave DejaGNU another try.

While there is a fair amount of documentation for DejaGNU, it's all about how to set it up cross platform. I had to experiment to determine what it wanted. Key trick: It wants to be run in a directory of which subdirectories are named *.t00, *.t05 etc. It will then run all the *.exp files in those subdirectories. I only found that out by looking at an example.

More problems cropped up, related to TCL's stupid syntax. I got it working but I really don't look forward to maintaining this creature.

POSIX-conformant format - not great but tolerable

[This started out as a note where I first talked about DejaGNU, but it grew into a section on its own]

When I say POSIX-conformant test results, it doesn't mean that there's a section of POSIX 1003.3 specifying test result formats. You won't find it, I checked the TOC2. Apparently it's implied thruout.

What POSIX does right:

  • The major taxonomy for tests results is almost right: PASS, FAIL, UNRESOLVED, UNSUPPORTED and UNTESTED.

    DejaGNU extensions add 4 more categories: two for expected failures (XPASS,XFAIL) and 2 for known bugs (KPASS,KFAIL).

What it does wrong:

  • Merges presentation format and report format. True, one could do further formatting on top of the given format, but that's a nitpick; the problem is that the merger cripples the report format. That leads to the next item.
  • No provision for extra information, just a test name.

    A test report format should allow and frame a fair bit of additional information: diagnostics information, additional taxonomic information (eg what caused a given dormancy), context information (test location, exactly how to rerun the test). This information should be in very flexible format. The TESTRAL spec is a good model for what should be allowed.

    Some might argue that analytical information has no place in test reporting. I disagree. It's not just convenient, it belongs there. It is entirely reasonable to treat the set of test cases for diagnosis as overlapping with or identical to the relevant set of test cases for regression testing. Indeed it would be wrong to omit relevant diagnostic cases from the regression test suite, and pointless to construct new cases for diagnostics where test cases exist. So obeying DRY, those cases should not be repeated in code, they should be used from where they sit, in test clauses.

  • While the major headings of the taxonomy are nearly correct, POSIX doesn't place all results in the correct category. In particular, dormant results are placed into all the non-boolean categories: UNRESOLVED, UNSUPPORTED and UNTESTED. It also mingles dormant and bad-test results in UNRESOLVED.

    Dormant tests are tests that are present in some form but deliberately not run for any reason. There is a wide spectrum of possible reasons for dormancy. Some are extremely lightweight (user temporarily marked a test dormant), some are heavyweight (system configuration, in all its many guises).

    Splitting up dormant results into extra categories that requires the test-runner to understand dormancy. That's not just extra work. In general, it can't. Dormancy can be marked for reasons far beyond what the test-runner could understand, and that's not even its job to understand them.

  • The DejaGNU extensions move analysis and presentation logic into the test-runner. That's wrong.

    However, it does make sense to recognize that a given test corresponds to a bug in a bug database, and to have persisting marks of expected failure, possibly even shared by a team. It shouldn't be the test-runner's job to do those things, however.

    Whose job is it then? I long ago analyzed essentially this question for Emtest. I believe the same holds true for these uses of persistence. My conclusion is that:

    • Persistence belongs on both sides of the test-report-presentation/test-runner interface
    • Ultimate control wrt using persistence belongs on the presentation side.
    • The test launcher has a read-only interest in the information - it should can optionally not run expected failures. It shouldn't be required to be that smart, and also belongs on both sides of the interface, with the higher level launcher logic being on the presentation side.
    • The test-runner has essentially no interest in persistence. At most, it incidentally calls libraries that know about persistence.


1 AFAICT, it was updated in an apparent one-time effort in 2006.

2 I'm not paying umpteen hundred to buy a printed copy of the spec. I'm going by D J Delorie's writeup of it.

24 November 2010

Review of How Pleasure Works

Review of How Pleasure Works by Paul Bloom

This book has some interesting bits, but doesn't live up to its title. Bloom never articulates a coherent theory of how pleasure works. Nor does he really pull the individual chapters together into a unified theory of pleasure.

As he immediately tells you in the preface, he's not talking about simple animal pleasures, food and comfort, that sort of thing. He's talking about uniquely human pleasure, or pleasures with a uniquely human aspect.

The chapters each explore some aspect of how pleasure works. In each case, the chapter's mini-theory or theories explain some pleasure phenomena but can't explain others.

For instance, in chapter 6 (Imagination), the theory is that the pleasure we take in stories is a sort of extra-powerful daydreaming, more vivid than our real daydreams and amplified by the storyteller's skillful choices. But (as he admits) this is far from a complete theory of why we like stories.

When I was done reading this book, I asked myself what in it would explain the pleasure we take in music. And my answer was, none of them. Chapter by chapter:

  1. The Essence Of Pleasure. This whole chapter seemed pretty insubstantial to me. "Essence" is far too elastic a term to carry any real explanatory weight.
  2. Foodies - about food. Not applicable.
  3. Bedtricks - about sex. Not applicable.
  4. Irreplaceable - No, essentially all music is copies of something. Not just copies of recordings of a performance, but performances themselves are often copies, too. The few times I've heard improvisation, while in some cases I appreciated the skill, it didn't strike me as particularly great music.

    You could say that a given opus is irreplaceable: What could replace Beethoven's 6th symphony or Eine Kleine Nachtmusik? But that seems to be stretching the point beyond reason. Were all (say) roses to perish from existence forever, we'd miss them too, but that doesn't mean a rose is irreplaceable.

  5. Performance - the mini-theory here is that our pleasure is an appreciation of skilled performance.

    For me, sure, though appreciating virtuosity is hardly the only pleasure I take from music. But many people who have zero musical sophistication enjoy music. Often they don't even know what instrument they are hearing played. How are they appreciating the performer's virtuosity?

    Furthermore, music is often used to accompany other performances - background music, overture, dance. Here most audiences are unlikely to appreciate the performer's skill or to even notice the music as performance.

  6. Imagination - stories explained as extra-powerful daydreaming. Doesn't seem applicable. One might try to force it to fit by arguing that music sometimes helps you daydream, but I think that's a stretch.
  7. Safety And Pain - Again largely about stories. Here the mini-theory is that we get to experience things we couldn't safely experience in real life. It's hard to see how that could apply to music.
  8. Why Pleasure Matters - this chapter mostly tries to tie pleasure to religion. Not applicable.

So How Pleasure Works did not explain music. It's often an interesting book, but far from a coherent theory of how pleasure works.

21 November 2010

Handling source in EMSIP

EMSIP source parser

I split this post in two. I was originally going to talk about EMSIP as a whole, which I do here. But right now I want to talk about one of those little parsers, the source code parser, which is designed to handle the main flow of source code.


  1. Simplicity
  2. Follow EMSIP's framework and spirit
  3. Where there exists a compelling "write it without thinking" expression for something, allow it.
    • Exception: infix syntax for algebra, which would conflict with (2)
    • For instance, allow:
      • quoting of expressions
      • distinguishing numbers from symbols just because of how they look.
      • prefixing numbers with minus sign, eg -6.
    • I emphasize that this rationale is limited. If you would have to think even for a moment about how to write an expression, it doesn't fall under this clause. This is only for expressions that people tend to automatically write before thinking. It's not meant for tending-towards-Perl shortcuts.

The objects to express

As I mentioned, essentially all the complexity of source code would be pushed out to little parsers:

  • comments
  • strings
  • radix numbers such as hex numbers
  • complex numbers
  • rationals
  • bytevectors
  • vectors
  • character escapes
  • (etc)

And obviously they're each pretty simple.

Relieved of the burden of representing every strange object, I think we can get by with just a few simple things:

  • decimal numbers, essentially as traditional.
    • Introduced by a digit, or -, + followed by a digit.
    • Continued by a digits, letters, and some punctuation.
    • We'd support just decimal integers and inexact decimal floats. Not much more - maybe exact floats. No attempt to represent complex numbers, rationals, numbers of various precision, etc. Those are to be supported by little parsers.
  • symbols, essentially as traditional.
    • Introduced by a letter. Also by -, +, $, or -> followed by a letter. (We'll extend this group lower down)
    • Continued by a digits, letters, and some punctuation.
  • Punctuation groups
    • Introduced by quote, even in non-whitespace states

      For regularity, I'll say that - or + followed by punctuation can start a punctuation group.

    • Continued by other punctuation. Not continued by quote; quote starts another punctuation group.
    • Purpose: "funnier" manipulations of source, such as quoting the next sexp or dotting a list.
  • Whitespace

So the first character after whitespace, a bracket, or punctuation controls the type, with a few exceptions which must consider their next character.

In table form:

typeBegun byContinued by
numberusually by digitalmost anything
symbolusually by letteralmost anything
punctuationjust quotejust the leftovers

The classes of character

That gives us these classes of character:

(not brackets)
brackets are accounted for by EMSIP main.
a-z, A-Z
strong punctuation
just single-quote '
weak constituents
  • Includes definitely +, -, >, <, $, and probably all the other punctuation.
(illegal characters)
characters that just shouldn't appear.

The parse states

No we have just these major parse states:

Entered by weak constituents.

This state can be terminated by anything except weak constituents.

Becomes part of something else depending on how it's terminated.

  • By digit: becomes part of a number
  • By letter: becomes part of a symbol
  • By strong punctuation: becomes part of a punctuation group
  • By whitespace or bracket: makes a symbol of just itself.
Entered by whitespace.

This state can be interrupted by anything except whitespace.

Makes a whitespace object whose tail transformer obliterates it.

punctuation group
Entered by strong punctuation.

Makes a punctuation group.

This state can be terminated by anything except weak constituents.

This is two similar states, number and symbol.

Entered by digit or letter

This state can be interrupted only by whitespace, strong punctuation, or closing bracket.

Makes a number or symbol

In table form:

StateMakesEntered byInterruptable by
Uncommitted(Depends)Weak constituentsAll but weak cons
WhitespacewhitespacewhitespaceAll but whitespace
PunctuationPunctuation groupStrong punctuationAll but weak cons
Committednumber or symboldigit or letterWhitespace, strong cons

suitable SREs

These SREs1 assume the classes of character as primitives. They assume longest-match parsing so `uncommitted' will be part of a larger construct when possible.

(* weak-constituent)
(* whitespace)
(seq strong-punctuation (* weak-constituent))
(seq digit (* (~ strong-punctuation whitespace)))
(seq letter (* (~ strong-punctuation whitespace)))
(seq (? uncommitted) committed-number)
(seq (? uncommitted) committed-symbol)
(seq (? uncommitted) punctuation-state)
(* (or whitespace number symbol punctuation uncommitted))
(* thing)

Operating on promised tails

One thing about EMSIP parsers is that they include transformers that operate on a promise of their tail. This lets us do some important things neatly, such as skip comments and splice character constants into strings. It also implies that read gives us a stream of objects, rather than returning one object and being called repeatedly. We'd use essentially this mechanism for some business in the source parser.

Whitespace objects use this mechanism to remove themselves. Their tail transformers just return the tail.

Punctuation groups will also operate on promises of their tails. This lets them handle all the "funny business" that traditional sexp syntax does with quote and its friends.

Numbers and symbols don't do anything special, they just cons themselves onto the tail.

And that's it!

And that's all we have to do to get a reasonable source parser under EMSIP.


1 For more on SREs, see SRE or a very similar idea, Irregexes

Introducing EMSIP

Introducing EMSIP

Some motivation

I like s-expressions

I like s-expressions. They don't make you memorize new syntax for every new feature someone added. Editors and other tools can almost always make sense of them - with heavier syntaxes, there's always someone who thumps their chest and claims that they "could write a parser for it in an afternoon, no problem!" but when you look for actual existing tools that work on a given heavy syntax, finding them is a different story.

Not only can editors handle s-expressions, you can write code to process them in new ways without ever having to wrestle with a parser. In both Scheme and Common Lisp, the code to read an s-expression is just `read' - can't get much simpler than that.

And I have philosophical reasons for liking them. S-expressions are regular, regular is clean, and clean is good.

But they could go further

Quick, where does this s-expression end?

#mynewreadermacro a b c () ) () "["]{} 52:abcde

Of course you can't tell. In Lisp (Common Lisp, that is) there are many dozens of standard reader macros, and new ones can be added.

And it's not just reader macros. There are a few built-in extra bumps:

  • Double-quotes for strings
    "A string"
  • Literal symbols in Common Lisp, like
    |One literal symbol|
  • Infix operators such as quote, backquote, and low quote.

Scheme is better but headed worse

But surely Scheme is clean and small, right? Well, the impetus to write EMSIP up and post it came from Scheme. I've just been looking at the R6RS spec, and I'm stunned at how complex the Scheme syntax has become. Not up to the complexity of Common Lisp's full syntax spec, but headed in that direction.


I propose instead an idea that I've been developing for some time. I call it EMSIP. The idea is to have:

  • brackets (3 types and a comma)
  • A lot of little parsers that cannot violate brackets. They are expected to only parse regular expressions.

Think of it as a case of separation of concerns. No little parser needs to care how any other little parser behaves. It need only interface properly with EMSIP.

It may sound like not enough to implement a proper programming language, but it is. When I get more round tuits, I hope to add EMSIP into Klink, the interpreter I'm writing for Kernel.

Essentially all the complexity would be pushed out to little parsers:

  • comments
  • strings
  • radix numbers such as hex numbers
  • complex numbers
  • rationals
  • bytevectors
  • vectors
  • character escapes
  • (etc)

EMSIP main syntax

In EMSIP there are only seven magic characters:

  • open parenthesis
  • close parenthesis
  • open curly braces
  • close curly braces
  • open square brackets
  • close square brackets
  • comma

Each of them always has the same structural effect, so even a naive processing tool can tell where structures begin and end. All bracket types consume exactly the text from the opening bracket to the matching closing bracket. Never more and never less. So even if an editor or other tool knows nothing about the syntax, it knows when an parser's scope ends.

The little parsers have their own internal syntaxes which:

  • apply only to their scope.
  • cannot override brackets, but treat the comma as non-magic.
  • Can only do regular expressions.

More to come

I'll write more about EMSIP later, I hope.

20 November 2010

reading and writing

Thoughts on reading and writing in Kernel

This is an email I sent to John Shutt, designer of Kernel, which I am writing an implementation for. I thought I'd put it up for public scrutiny and comments.

The email

I am implementing printing and reading for Klink now, and it made me think about printing and reading objects in general, which I know you were looking for a general solution to.

From the partial state of printing, it's already clear that operatives frequently present themselves to be printed as bare objects and not as symbols evaluating to themselves.

So I'm proposing that as a fallback before we resort to printing an object as "#<whatever>" we do the following:

  • When printing an object:
    • find a visible symbol bound to it (if that's not possible, we can't do this)
      • NB, this is for objects, not for symbols that are intended to be captured, eg symbols bound inside a $let.
    • print an indication of an environment it can be found in and that environment's state. I'm thinking of a hash or a standard library path.

      We print the state to avoid problems mutation could cause. If we printed, rebound the symbol, and read, the result might not be equal? to the object.

      That environment is not neccessarily the environment the symbol is bound in, nor neccessarily the current environment. It need only be one the symbol is visibly bound to the object in. It could be a standard library because read doesn't bind anything in that environment nor give access to the environment.

      That indication of environment might encompass a sexp of arbitrary size.

    • print an indication of that symbol (Not neccessarily the same as printing that symbol)
  • When reading:
    • know an environment to find symbols in.
    • when reading the symbol indication above, look it up in that environment
    • In response to environment indications, change to the right reading environment. It's an error if it doesn't exist or it's in a different state.

It has some drawbacks:

  • Con: It introduces reverse-association into environments.
    • Mitigating: Introduces only what list-bindings would introduce, because one could just try every binding.
  • Not an issue: Finding multiple bindings for the same object. We can just use the most convenient one.
  • Con: Naively implemented, requires a lot of searching to find an object's symbol.
    • Mitigating: Need not be naively implemented.
  • Con: We need to control what counts as the visible environment for read and print.
    • Mitigating: That's easy.
    • Pro: OTOH that gives us some extra control.
    • Pro: G1 implies eliminating the special case of one predetermined reading environment.
  • Con: We need to calculate and represent an environment's state uniquely.
    • Calculating it is doable, but it's work.
    • Mitigating: There are ways to represent it: hashes, standard libraries.
  • Not an issue: Causing evaluation during read. This only causes symbol lookup.
  • Con: There'd be 2 ways of printing what are now printed as symbols. For someone writing forms, that's an error opportunity and violates G3.
    • Mitigating: That's only an error opportunity for manually written sexps, which generally means forms. But in that case the forms are evalled, so bare symbols will be looked up. This usually gives the same object, and when it doesn't, there really were 2 distinct objects by that name so there's no getting around needing 2 distinct representations.

      Further, the easier form to write (bare symbol) and the more likely meaning (local binding of that symbol) coincide.

That's a lot of cons, but ISTM each one is mitigated fairly well.

I'd say this strategy is a lot like paths (mymod:foo) and should be generalized in that direction. So:

  • the "indication of symbol" would be a sort of path.
  • the "indication of environment" would be contained in something like a $let, except purely syntactic and allowing only environments.

So as a first sketch, it might look something like:

   ((mod-1 (id 34fe93ab5539d9bc8))
      (mod-2 (kernel devel)))
   ;;The object bound to `foo' in `mod-1'
   #object (mod-1 foo)
   ;;Object bound to `bar' in (kernel devel testhelp mocks filebuf)
   #object (mod-2 testhelp mocks filebuf bar))

I'm certainly not committed to this syntax, but it gives the idea. I don't like the prefixuality of the sharps.

That gives us this overall strategy for printing objects:

  • For literals - Unvarying print logic that does only superficial inspection of data
    • I include symbols as honorary literals because when read and written, they have no relation to an environment.
  • For containers (at least for transparent ones) - print logic that includes printing their contents, read logic includes calling a constructor that reconstructs them from their contents. The results of write-then-read should satisfy equal?
  • For non-containers or opaque containers: The method described above.
  • Last resort method: "<#opaque-type ID>" style.

What do you think?

Tom Breton (Tehom)

17 November 2010



I've been thinking about library paths in relation to Klink, the Kernel implementation I've been writing.

Some lessons from the past

  • Being familiar with Elisp's horrid treatment of library naming, I'd like to do library-finding right, and not hobble for years with a suboptimal system.
  • Scheme's use of lists as library specs is the right way to do it.
  • SRFI-103 suffered from a lack of encoding for identifiers that don't map cleanly to file names. Avoid that.
  • SRFI-103's attempt to escape some identifiers was overly special, resulting in a non-trivial knowledge bad. Don't do that, use something well-known.
  • It is really nice to allow a "main" file in order to group a file with its supporting packages.
    • From SRFI-103, though its "main^" syntax was awkward.
    • From my experience creating Elisp packages, especially Emtest.
  • Want to support at least these parameters:
    • An ordered list of directories to search.
    • An ordered list of file-name extensions
      • From SRFI-103 "to distinguish dialects, formats, and extensions"
  • Want to at least optionally choose among available versions of a library.

    I don't have to support that in the near future, but it should at least be specified.

  • It's nice to allow sub-features in "require" type statements
    • From emacs >= 22 `require' statements, and from Redhat rpm.
    • This clashes with the previous lesson.
  • Traditional major.minor.release version numbers have some problems, especially wrt project branching.
    • Git among others.
    • This clashes yet further with the previous two.
  • I'd really really prefer versioning that doesn't push the version-update treadmill to go any faster than it has to.
    • Too many times on the version-update treadmill.
  • Reading the first syntactic datum is a wart on the procedure, and I may be able to solve it.
  • It is nice and convenient to have a way to abbreviate file locations, something more transparent than calling expand-file-name everywhere.
    • From SWI Prolog and from emacs' default-directory
  • Init files should be ordinary source files, implying that they should fit into this library scheme. They come in more flavors than you would expect.
    • Learned from many places, especially TinyScheme and emacs.
  • It's often convenient to be able to associate "support" files with a source file. Their extension should differ so that you can tell them apart.
    • Learned from Mercury and from C/C++
  • The normal source file extension should be unique. So use resources like Filext to check it.
    • From Prolog and Perl's clash over *.pl (Prolog was there first)
    • "KRN" is taken. Could take it anyways since it's unrelated, a font for a video game
    • "KRL" seems available.
  • Sometimes one would like to define multiple submodules within the same file.
    • Learned from Mercury and from an experiment I did in Prolog.

Answer 1: Just borrow URL encoding.

It very simple and it's in widespread use.

Now, what to do about clashes? Eg, what if files "too%06dany" and "toomany" both exist?

Several answers present themselves:

  • Prefer them in alphabetical order?
  • Prefer them in almost alphabetical order, but sort the escape sequence to before/after other files?
  • Check for the existence of clashing filenames and raise error.
  • Don't allow clashes. Enforce a one-to-one mapping of strings to filename encodings.

I like the last answer best. Encodings that are not mandatory are forbidden. In other words, for any file name it must be the case that:

(encode (decode filename))

gives the original filename again.

Answer 2: Encode "main" with a similar escape sequence.

That is, some escape sequence that would otherwise be illegal means "main". Let's use "%main" for that.

Answer 3: A file is a list of sexps

This will probably be the most controversial answer. Instead of reading the first syntactic datum in the source file, let the whole source file be a list. That is, when you start to load a file, pretend you just read an open bracket, and the matching close bracket is EOF. Each sexp in between is one element in the list.

That even handles the case where the file is just a script: Put Kernel's `$sequence'1 as the first symbol in the file.

Tentative answer 4: Library namespaces for special purpose

These would include at least:

  • Standard preludes for an implementation
    • Say the implementation is called "acme", the prefix would be;
      (implementation acme) 
  • System rc files for various purposes
  • Personal rc files for various purposes

Tentative answer 5: Resolving the treatment of versioning.

Recapping, the things I wanted were:

  1. Want to can choose among available versions
  2. Use sub-features for version requirement.
  3. Version IDs need not be major.minor.release
  4. Don't make the version-update treadmill go faster than it has to.

If we were to use traditional major.minor.release version numbers, that would satisfy #1 but scuttle #2 and #3.

My tentative answer is to eschew the false economy of not reading a file before you need it. Our code should just partially evaluate the file and see what it places in a "subfeatures" area2. If that's ambiguous, then we have a problem - but we always would have in that case.

If efficiency is required, do it behind the scenes. Possibly pre-compile data for individual files or even to index the entirety of code available on a given system.

For #4, I have a tentative scheme involving automated tests and code coverage tools, but that's for another time.

Answer 6:

If a library path resolves partway thru to file rather than a directory, look in that file for submodules. IIRC this is a direct clone of how Mercury does it.


1 equivalent to Lisp `progn' or Scheme `begin'.

2 Kernel is basically re-inventing these things with first-class environments. So very likely subfeatures will be an environment accessed by key from the main environment that the library provides.

05 November 2010




Klink (as Tinyscheme) has two representations of its stack:

  • As a list.
    • Pro: It allows reusable continuations
    • Con: It's very inefficient.
  • As a resizable array
    • Pro: It's much more efficient
    • Con: It does not allow reusable continuations.

Neither choice is very appealing.

The answer

Thinking about the problem, I invented what I call a copy-on-write stack. It's really a set of arbitrarily many stacks. The multiple stacks can share their deeper elements, while cheaply pushing and popping their shallow elements.

I'm not actually going to add COW-stacks to Klink any time soon, but I thought about how they would work and I wrote it down. Stray references below to Klink and continuations are due to the context in which I was thinking about COW-stacks.

How: COW stacks

Entities, briefly:

Stack element
Linear section (of stack elements)
The part that's visible from outside; the stack as a container.

Operations, briefly:

  • push
  • pop
  • capture a continuation
  • finalize a continuation

Overall idea

We hold a tree of linear sections of stack elements. The sliders, representing states of execution, can move along the tree, and can be cloned for continuation capture. These is to work as if each continuation capture had copied the entire stack.

Desired properties
  • The most common operations go very fast. In particular, when there's only one slider, this should do almost nothing that an ordinary stack would not do
  • Allow arbitrarily many partly-shared stacks.
  • Minimize copying the stack.
  • Each stack works as if the entire stack had been copied

Entities in detail

Stack element
  • Essentially as a dump stack element now
  • Parts:
    • Next instruction (ie, to be continued to)
    • Arguments
    • Possibly data
      • Rethinking data function types: "data" may just a special argument that it passes to itself, that our special data-function type gets from its car. Do we ever need to actually store that?
    • List of dynamic bindings
      • Could cache the identity of the next SE, if it never moves in memory.
Linear sections
  • A linear section of the tree. Generally associated to one or more sliders. It keeps a list of these sliders roughly in order from shallowest to deepest. Ie, highest index to lowest.
  • Parts:
    • Reference to its parent linear section
      • For the root, this is NULL
    • Index into its parent linear section
    • A growable array of stack elements
    • Length of that array
    • Pointer to a slider, understood as the head of a list.
      • Can be NULL (for an empty list) if this linear section has become unoccupied.
  • The object representing a state of execution.
  • Owners:
    • Captured continuations
    • The interpreter
  • Parts:
    • Reference to PO
    • Index into the PO's array of stack elements.
    • Reference to the "next" slider, understood to be used as a cell in a list.
      • For the last element, this is NULL
      • Would be nice for this to be a weak reference, since it shouldn't keep sliders alive.
    • States:
      This element is the head of the list
      This element, while not the head of the list, may well be equally as shallow. It's not guarantee to be, since we mostly don't conform marks when we move other sliders.
      This element has been moved deeper than its "proper" position in the list. It can remain out of position indefinitely.
      This node represents an unreferenced slider and can safely be removed at any time. That's just if we choose strategy (2) for removing sliders.
      This node is none of the above.

Two strategies for removing a slider

  • Traverse the list and when we find it, perform list surgery.
  • Mark it dead and when it's encountered, discard it.

I have no strong opinion on which is better.

Operations in detail

  • pop
    • These are the possible situations:
      • Decrement within linear section
        • When: index > 0
        • Action:
          • Decrement the index.
          • If state is:
            Make new list of sliders, keeping this slider on list.
            (other states)
            Set it deeper-than-this.
            Shouldn't happen.
      • Decrement out of linear section to previous LS.
        • When: index == 0 and LS has a parent
        • If state is:
          Make new list of sliders, removing this slider from the list.
          (other states)
          Do nothing special for this step.
          Shouldn't happen.
        • Remove it from the list
          • Clone a new slider and mark this one dead?
          • Extract it from the list?
        • Add it to the parent LS
          • Index is the index that the child LS had into its parent.
          • State: maybe-shallowest
      • Decrement out of root linear section
        • When: index == 0 and LS has no parent
        • Stack is empty. Interpreter is done.
  • push
    • These are the possible situations:
      • State is surely-shallowest
        • Remarks: This state does nothing special
      • State is maybe-shallowest
        • Compare to the actual head's index. If this slider is:
          • As deep (or deeper):
            • Mark that one maybe-shallowest
            • Remove ours from the list.
            • Mark ours (or its clone) surely-shallowest
            • Place ours (or its clone) at the beginning of the list.
            • Continue as surely-shallowest.
          • Not as deep
            • (We needn't mark ours normal because we're about to remove it)
            • Proceed as for normal.
      • State is normal or deeper-than-this
        • Make a new LS
          • Its parent is the old LS
          • Its index into parent is our old index
        • Remove this slider from the list.
    • In each case, having made sure we are safe, we add another SE:
      • If the array isn't large enough, enlarge it.
      • Write an element to PO at our index
      • Increment our index by 1
  • capture a continuation
    • Make a new slider. Insert it behind the old slider, same index.
      • Index: Same as old one
      • State
        • If old slider is surely-shallowest, make the new one maybe-shallowest.
        • Otherwise give the new one the same state as the old one.
  • finalize a continuation
    • Remove its slider from the list
  • Initialization
    • Make one linear section
      • Initialize its array of SEs.
    • Make one slider, surely-shallowest, index 0, pointing at that linear section.

Supporting operations

Conform sliders
Conform a first element that moved or was removed.
  • Remarks
    • This was just intended for when the first element moves. Other elements mostly stay in place, and are only reconsidered if first element happens to see them while it's rebuilding a list.
    • We'll examine the next elements, collecting a new list that we'll give to the LS. We'll also conform the portion of the list that we see, and we'll make sure that that portion extends as far as the next-shallowest index.
  • Action:
    • (Simple case, for comparison) If there's no next element, we're done. New list is this element or nothing.
    • If an element is this slider itself, ignore it. We'll insert it when we have surely decided its position.
    • If there is one and its state is:
      This shouldn't ever happen.
      Ignore this slider. It won't be part of the new list.
      We'll put in rough order in the collected list.

      Possibilities of its index relative to our slider:

      • Its index is shallower (presumably only by 1)
        • Set our slider's state normal since we can rule out the other states.
        • Do we know a definite head yet?
          • No:
            • This is now the definite head.
            • Mark it surely-shallowest
            • Record min-index
            • Record reference to its cdr, the tail cdr.
          • Yes:
            • Leave it in its place in the list
            • If it's index is:
              • Same as min-index
                • Mark it maybe-shallowest
              • Larger than min-index
                • Mark it normal
              • Smaller than min-index
                • This shouldn't ever happen
            • Record reference to its cdr, the new tail cdr.
      • Its index is the same as our slider's.
        • Remarks: We may or may not want to also mark other nodes, since they may tie for shallowest.
        • Construct an intermediate list consisting of:
          • Our slider, if we're leaving it on the list
          • The new slider
          • Append the deeper-than-this list, if any.
          • Append the new slider's tail.
        • Remark: If there's no deeper-than-this list, this construction amounts to consing our slider to the new slider.
        • Do we know a definite head yet?
          • No:
            • Mark our slider surely-shallowest
            • Mark the other slider maybe-shallowest
            • Return the intermediate list
          • Yes:
            • Add it as the tail of the collected list
            • Return the collected list
      • Its index is larger than our sliders:
        • If other node is deeper-than-this:
          • Queue it to a deeper-than-this list
            • We'll add that list when we construct an intermediate list just before returning.
            • It's a queue so nodes stay in order.
          • Continue (ie, examine next node)
          • Remarks:
            • There's no point try to remark it.
              • It won't match min-index
              • It won't match our slider's index.
        • Otherwise:
          • Same as for same index.
    • If there's none (general this time),
      • Construct an intermediate list consisting of:
        • Our slider, if we're leaving it on the list
        • Append the deeper-than-this list, if any.
        • No tail
      • Treat it as in maybe-shallowest, same index.

30 October 2010

Kernel can't have and-let

Kernel is strict about boolean types

Kernel is the Scheme variant by John Shutt that I've been blogging about.

One thing about Kernel is that it does not let non-booleans be used as tests in conditionals. Ie, it doesn't use what other Lisps call the extended boolean - every object other than false is deemed true.

In a sense, I agree with this. There is virtue in keeping the types straight. But OTOH, it blocks some Scheme conveniences, which then need to be reinvented.

So it can't use #f as an out-of-band object

A common Scheme (and Lisp) idiom is to use the false object as an out-of-band object. For instance, a search that find nothing returns #f.

This is made easy by the latent typing that Scheme has. The #f object can go anywhere unless you explicitly check for type, and still it is easily checked for. But that's never been a perfect plan. There's always the question of what you do when the search (or w/e) gives the #f object itself. Sometimes there's also the question of how to distinguish different non-fatal failures.

But in Kernel that's not available because #f and #t aren't suppsoed to mix with other objects, lest they cause $if to exhibit hard-to-find bugs.

Kernel does provide a special object called #inert for "no information" returns. John discusses this in the rationale for assoc, contemplating making assoc return a pair of a boolean and either the answer of #inert. For instance, a successful search could return:

(#t . The-Result)

and a failed search

(#f . #inert)

These are easily treated because Kernel makes multiple binding available everywhere.

But then he goes on to argue that under the hood, assoc will have to deal with a result object that is () on failure, and that the user can easily re-represent it as bool+result if desired. He therefore concludes that assoc should just return either () or the found object.

I'm not entirely sure I agree. It seems like this situation is particular to assoc (and assq and a few others), not generally available. Won't other search functions still have to return bool+result? One could argue that it's more important to make assoc uniform with other searchers, and the nil-on-failure version could be provided also.

Also, it has always struck me that assoc was often more difficult than it needed to be, because instead of returning the value that's usually sought, which is the cdr of the cell, it always returned the whole cell. It does this for two reasons:

  • Sometimes one wants the whole cell.
  • It needed to have a means of indicating failure. If it returned the cdr, that would sometimes be (), the same as the failure result.

Kernel's multiple bindings almost solves this problem. It solves it in the success case. But if assoc's failure value is (), multiple bindings won't help. One can't bind the result's cdr, because () doesn't have one. So there's another argument for assoc returning bool+result.

So no and-let

One piece of convenience that this denies is and-let*. And-let* is a sort of hybrid of and and let*. It binds symbols successively to values, like let* does, but if any of those values is #f it terminates, like and does. and-let* makes it very convenient to alternate between getting values and testing that those values are suitable. I find that this situation comes up a lot.

But since one can't (or shouldn't) mix booleans and non-booleans, there won't be many #f results to terminate it. They could only come from functions that return booleans, and bindings their results makes little sense, since if and-let* succeeds the values can only be #t.

What to replace and-let* with

IMO the best replacement for `and-let*' is a pattern-matching `let*'. I wrote something like that as part of "match.el" in Emtest. It would go beyond Kernel's what multiple bindings can do and be a true pattern language. Constructor functions such as cons and list would have matching destructurers. It'd be something like:

      ;;This clause has the same interpretation as in let
      (a value1)
      ;;Same interpretation as in Kernel's let but not Scheme's
      ((b c) value2)
      ;;Like ((d e) value3) in Kernel, because ~list~'s matcher
      ;;would accord with ~list~.
      (,(list d e) value3)
      ;;Call a matcher associated with the constructor ~foo~ to
      ;;destructure value4 and bind its parts to f and g
      (,(foo f g) value4)
      ;;Result is not bound but must be 12
      (12 value5)
      ;;Result is not bound but must be #t
      (#t value6))
   ;;In the body, a thru g are all bound.
   (do-stuff a b c d e f g))

This is clearly quite powerful. It probably belongs in a pattern-matching module.

No type signatures

The non-mixing of booleans and non-booleans suggests to me that there should be some sort of a type signature in Kernel, at least to distinguish booleans from everything else. But there isn't.

The name #inert

Kernel also has a special object called #ignore. ISTM #ignore and #inert are too similar in name and role, and would be easily confused. So I have some suggestions for renaming #inert:

  • #n/a
  • #not-applicable
  • #not-available
  • #no-info

Kernel: rationale for not listing bindings

Understanding Kernel's rationale for not listing bindings

As I blogged, I have been reading about Kernel, a better Scheme.

One thing Kernel will not do is to list all of an environment's bindings. At first glance this seemed wrong, because the first rationale that I read for it seemed to contradict itself:

From (§6.7.1)

Presenting [$binds] as a library feature drives home the point that it doesn't introduce any capability that wasn't already provided by the language. In particular, for purposes of type encapsulation (§3.4), there is still no way for a Kernel program to generate a complete list of the variables exhibited in an arbitrary environment. Predicate $binds?, or the techniques used to derive it, could be used to formally enumerate such a list, by sequentially generating all possible variables (an infinite sequence) and testing each one; but there would be no way to know when all variables in the environment had been enumerated. This nontermination is critical because it means that no Kernel program can prove in finite time that two non-eq? environments are equivalent up to mutation; therefore, the rules governing predicate equal? (§4.3.1) do not require it to return false in any cases where predicate eq? returns true.

I read it thru 3 times, and he seems to have accidentally flipped polarity in his argument: He wants to avoid a situation where (called on two objects) eq? would give false and equal? true. But the thing that he fears will give rise to it is being able to completely compare non-eq? environments.

When I read this I got the impression that his rationale was mistaken. But then when I read the paper again, I saw a better rationale elsewhere.

Better rationale

When introducing environments, he gives a different rationale for linking equal? and eq? on environments:


The behavior of equal? is tied to that of eq? to forestall the possibility of an implementation compromising the encapsulation of the type by allowing a program to determine, in finite time, that all bindings for one environment are the same as those for another. (Cf. the rationale discussion for the derivation of library predicate $binds?, §6.7.1.)

And (§4.9) reinforces it:

There is no need for this module to assume the Equivalence under mutation module, because environments are eq? iff they are equal? .

So the discussion in §6.7.1 was misleading. Now it seems that:

  • He fears that listing bindings will compromise the encapsulation of the environment type.
  • So he wants equal? and eq?, called on two environments, to always give the same answer.

Why does he want the environment type to be encapsulated in this way? Apparently to prevent mutation of parent environments. Also from (§4.8):

First-class environments offer a tremendous amount of control over what can be accessed from where - but only if there are limitations carefully placed on what can be done without explicit permission. In particular, whenever a combiner is called, it has the opportunity, in principle, to mutate the dynamic environment in which it was called. This power is balanced by omitting any general facility for determining the parents of a given environment, and also omitting any other general facility for mutating the parents, or other ancestors, of a given environment. (For an example of articulate environment- access control, see $provide!, §6.8.2.)

But what's the threat?

This unfortunately does not make it clear what the threat is. I just can't see any exploit that could use `list-bindings' to do something bad.

All I can guess is that it's that misbehaving code could get a list of all current bindings and comprehensively rebind all of them and thus fake the original environment, parents and all. But this is not the same thing as mutating that parent environment. And since one can sometimes correctly guess all the bindings, it is not clear how one can expect this to always work.

Why I suspect this prohibition isn't needed

Here's a positive argument that this prohibition is wrong: Containers such as alists can hold the same data that environments do: A set of bindings. Yet there is no problem being able to list all the elements of an alist or to compare alists `equal?' (and Shutt doesn't say there is).


I'm writing a Kernel interpreter

Kernel is the Scheme variant by John Shutt that I've been blogging about.

The two and a half days, I've been adapting the Tinyscheme interpreter to run Kernel. Right now it doesn't compile. Some of the choices I'm making:

New eval functionality

The first thing I did was to write a new eval loop and support functions to make and evaluate vaus, which are basically Kernel's version of lambdas. Like in Tinyscheme, it's separated into parts, and in between the parts the interpreter does other stuff such as nested evals.

C functions instead of new opcodes

Rather than grossly extend the OP_ list, I've opted to add the capability for the interpreter to call a C function instead.

This is distinct from the old foreign function type. All these functions have the signature:

typedef pointer (*kernel_func)(scheme * sc, pointer args, pointer env);

They are suitable for use as operatives, so a lot of the base Kernel functionality can directly be C functions.

Initially I coded some of the stuff as new opcodes, because that's the way TS has always done things. But I was never too thrilled about that way of doing things.

To support this, I:

  • changed the interpreter to include a field for the next function to call.
  • changed the stack frame code to save and restore that field
  • Added one opcode: OPKERNELCALL, which calls that function (or complains if it's NULL)

So right now both opcodes and C functions are supported and opcodes are primary.

New code for registering foreign functions

Since I have the new supported function type, I created a way of registering those functions. That is, knowing their C name and Kernel name (as a string), add them to a Kernel environment.

I wrote that facility for Tinyscheme in the first place, so it was pretty easily adapted. I corrected one thing I wish I'd done earlier and allowed this registering to apply to an arbitrary environment. Before, it always applied to the global environment (what Kernel calls the ground environment)

Added pseudo-object fields to interpreter

Tinyscheme creates some gotta-be-present objects as fields in the interpreter, such as #t, #f, and nil. Kernel has some new objects that I made similarly: #ignore and #inert

One thing I did differently: For some obscure historical reason, TS has both objects and pointers for those fields. It seems pointless; the pointer must always point at the same object, so one could just take the object's address. So for the new ones, I left out the pointer field.


Tinyscheme code really hasn't had a consistent indentation style etc. On request, I had added a style marking to the file local variables, which as "k&r" which seemed to nearly match the original style. But I like GNU style better, so I switched the file local variable to that.

28 October 2010

Kernel: A better Scheme


I have been reading Revised-1 Report on the Kernel Programming Language 1 about Kernel, a variant Scheme defined by John Shutt.

I found it interesting read:

  • I like his language ideas.
  • I like his mechanisms for them.
  • He scrupulously explains his motivations for each choice.

About Kernel

There is little else out there about Kernel. There's this paper, and an implementation in Scheme called SINK. Also a Wikipedia stub and some commentary on Lambda: The Ultimate. That's about it

FWIW, the name Kernel is very hard to search for. Googling "Kernel Scheme" generated mostly false hits. I'd suggest a change of name.

Everything is first-class / fexprs done right

Kernel's primary claim to goodness is that it makes everything a first-class object, even the things that Scheme wants to treat specially such as `define', `lambda', `and', etc. It does this by having its evaluator distinguish between "applicatives" which are essentially Scheme functions, and "operatives", which are somewhat like macros except2 that they take an explicit environment argument.


Kernel approaches continuations a little differently than Scheme. Unlike the rest of the paper, continuations were presented in a difficult way3, and it took me some careful reading to make sense of what he was saying.

Kernel has these primitives about continuations:

  • call/cc The familiar Scheme construct
  • extend-continuation. The purpose of this wasn't clear to me. It's not merely building a lambda form that will call the continuation. It seems to have to do with building trees of continuations, which guard-dynamic-extent treats specially.
  • guard-dynamic-extent is a more powerful, safer form of dynamic-wind. Each guard knows a continuation and only intercepts children of that continuation.
  • continuation->applicative which makes a continuation into essentially a function. Apparently in Kernel continuations are not simply applied but must be converted first.

It would have been nice to see some examples for how these constructs would be used.

These extensions seem intended largely for handling exceptions, but he doesn't specify an error format. (I like a fat error format myself, but that's a topic for another time)

Multiple-value bindings

Kernel has multiple-value bindings, and they are available everywhere, not just in a special construct (as in Common Lisp). So you can write something like:

(let (((a b c) (list 1 2 3)))
   (list c b a))

And you can bind not only lists but trees:

(let (((a (b c)) (list 1 (list 2 3))))
   (list c b a))

I agree with this feature. I keep wanting this in Lisp and Scheme. I've built it with macros more than once, but it's nice to have it properly in the language.

But ISTM full pattern-matching is a fairly short jump away from this. As the writer of Emtest, I find pattern matching nearly indispensable.

Possible issues with full pattern-matching:

  • "What happens when a variable occurs twice in the pattern?"

    The occurences must be eq?4 or the pattern fails to match.

  • "What if the pattern doesn't match?"

    Nothing new. It's essentially the same situation as when a list is too long or too short.

  • "So how do you match objects that are not lists?"

    Answer: Every constructor-like combiner5 should be able to have an associated pattern-matcher.

    NB, for encapsulations created by make-encapsulation-type (8.1.1), e should not be considered constructor-like. If we were to try to use e with a symbol argument as a matcher, there would be no correct return.

    • Always returning false would be wrong because the ctor might be instantiated with the same object, which should match.
    • Returning true would let the contents as returned by d bind to the symbol, which would break encapsulation and incorrectly make the returns of different calls to e eq?.
  • "If you have pattern-matchers, you need to indicate them, presumably by symbols. How do you distinguish those symbols from symbols to be bound?"

    I suggest a quasiquote-style syntax, where low-quote indicates a non-literal sexp.

  • "What about unexpected problems?"

    Having implemented more or less what I described for Emtest, I didn't find this especially problematic. However, I did not implement the quasiquote syntax because of how Elisp works.

  • "Is it really needed?"

    Maybe not in the core, but then again, maybe. ISTM one should be able to define functions that have patterns as argument lists.

More to come

I will probably post about Kernel again when I have more thoroughly digested the paper.

[fn:4] (Footnote from nowhere, I just thought this was interesting) I was surprised to find that John has, among his weblinks, a link to my old conlang Allnoun. Apparently he is interested in conlanging. Maybe conlanging and programming language conlanging go together.


1 I suppose we should abbreviate it R-1RK. So should the next report be R0RK or just RK?

2 There's much more to be said, but this is then main thing. Read the paper (ps or pdf) to get the full story.

3 For instance, he refers to `$let/cc' and `apply-continuation' before he defines them, and the descriptions of extend-continuation and guard-dynamic-extent required two readings.

4 In Emtest's "match.el" I used `eql', but Kernel behaves differently than Elisp and has no `eql', so `eq?' is correct.

5 "Combiner" is Kernel's term for what you may think of as a function's actual workings. It differs from a function in that it does not cause its arguments to be evaluated. Applicatives (which you may think of as "normal functions") can be generated from combiners with "wrap".