Wednesday, October 31, 2007

Javascript Packrat Parser

I've made some changes to my parser combinator library written in Javascript that I wrote about previously.

I was extending the example code from that post to parse more of the ECMAScript 3 grammar to test it out and was hitting performance issues. I ended up modifying the combinator library to be a Packrat style parser. The Parsing Expression Grammar Wikipedia page has a description of PEGs and what a Packrat parser does. Basically results of each parse step are memoized to prevent reparsing after backtracing, sacrificing memory for speed:
Any parsing expression grammar can be converted directly into a recursive descent parser[citation needed]. Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case.

By memoizing the results of intermediate parsing steps and ensuring that each parsing function is only invoked at most once at a given input position, however, it is possible to convert any parsing expression grammar into a packrat parser, which always runs in linear time at the cost of substantially greater storage space requirements.

A packrat parser[1] is a form of parser similar to a recursive descent parser in construction, except that during the parsing process it memoizes the intermediate results of all invocations of the mutually recursive parsing functions. Because of this memoization, a packrat parser has the ability to parse many context-free grammars and any parsing expression grammar (including some that do not represent context-free languages) in linear time.
I changed a few other things in the library to more closely map to the vocabulary of PEGs. 'alternative' is now called 'choice' for example. There are still quite a few loose ends to tidy up and documentation of course.

The updated library can be retrieved from my git repository:
git clone git://double.co.nz/git/jsparse.git
It can run in a browser but I've mostly been testing it using Mozilla Rhino.

I've included three basic examples. They all operate on the example expression grammar from the wikipedia article:
Value   := [0-9]+ / '(' Expr ')'
Product := Value (('*' / '/') Value)*
Sum := Product (('+' / '-') Product)*
Expr := Sum
The first, example1.js, is a direct translation of that grammer. It produces a pretty ugly default Abstract Syntax Tree however:
var Value = choice(repeat1(range('0','9')), Expr);
var Product = sequence(Value, repeat0(sequence(choice('*', '/'), Value)));
var Sum = sequence(Product, repeat0(sequence(choice('+', '-'), Product)));
var Expr = Sum;
The Expr parser can be called by passing it a string to be parsed wrapped in a 'ParserState' object. This object is used to keep track of the current parse position and the memoized results. A helper function, 'ps', can be used to construct it:
var result = Expr(ps("1+2*3"));
The second example, example2.js adds to this to produce a better AST. It also uses the 'chainl' parser combinator to handle grouping correctly. A quick online page demonstrating this example is here. Enter an expression matching the grammar (there is no error checking yet), and press the button to see the AST in JSON format.

The third example, example3.js, evaluates the expression as it parses instead of generating an AST. This is also available online to try.

I've also included in the repository the work in progress of the ECMAScript 3 grammar. It is not complete or correct yet but I use it for testing the library.

Based on what I've learnt from doing this I plan to revisit the way I did Parser Combinators in Factor.

Categories:

Labels:

10 Comments:

Blogger Roberto Saccon said...

Chris, are going to make the combinator parser fully Ecmascript3 compliant ?

1:08 AM  
Blogger Chris Double said...

I plan to get it to at least a reasonable level of compliance as part of working on the combinator library. I don't have a timeframe though. It's a bit of a 'backburner' project for me at the moment.

12:27 PM  
Blogger Alex Grigorovich said...

> var Expr = function(state) { return Expr(state); }

Neat, thanks!

8:30 PM  
Blogger Chris Double said...

Thanks, it was the best way I could think of to work around the mutually recursive parser problem.

Another approach would have been to make parsers referenced by a string, and looked up from a table of some sort:

registerParser("Expr", ...)
parse("Expr", "1+2*3");

I'm looking at good ways to generate the AST now. Currently 'action' is the way I do it:

action(sequence(Number,"+",Number), function(ast) { return ..newast... });

But inside the function it's hard to destructure the default ast. I'm thinking of doing something like:

bound(sequence(bind(Num,"x"),"+",bind(Num,"y")), function() { return { op: "+", lhs: this.x, rhs: this.y });

So you can assign var names to productions and reference them in the function to return the ast. It still has the verbosity of the 'function' syntax though.

I have a simple pattern matcher I could use for another approach. It lets me do stuff like:

match([ 1 2 [ 3 4 ] ], [ 1, 2, [ pat("a"), 4 ] ] )

which returns { a: 3 }

So it destructures and matches at the same time.

8:53 PM  
Anonymous Bart said...

Could you please make a snapshot available through other means than through Git? I currently have no hope of getting Git even to work on this platform... while I'm sure the code will most likely just work.

12:56 AM  
Blogger Chris Double said...

Sorry Bart, I didn't do a normal download since it's still in such a state of development. You can get a compressed tar file from:

jsparse.tar.gz

Hope that helps!

1:19 AM  
Blogger ToolmakerSteve said...

Chris, what remains to be a full ES3 parser? I am considering developing a PEG parser for Adobe's ActionScript 3, as part of a project to use that subset of ES4 outside of Flash [and be well prepared to switch to ES4/Javascript 2 down the road].
Considering your work as a possible starting point.

7:03 PM  
Blogger Chris Double said...

It just needs the grammar to be completed and tested. A lot of it is there but I only implemented bits I needed for testing.

10:36 AM  
Blogger Adam C. said...

If you use this in IE there's a bug in ParseState's at method, where it uses array syntax to access a character in a string. (The array syntax is a Mozilla extension, but not part of ECMAScript.) If you replace that with charAt, it works.

12:05 PM  
Blogger Chris Double said...

Thanks Adam, I'll fix that. I also have support for left recursion that I need to tidy up and commit as well.

1:50 PM  

Post a Comment

<< Home