Javascript Parser Combinators
Cleaning up my hard drive I came across some old libraries I'd written. One of them was a simple set of parser combinators written in Javascript. I put it in a git repository in case they prove useful to someone:
A parser in this library is a function that takes an input string and returns a result object. The result object contains three fields. They are:
'token' is a combinator that takes a string and returns a parser that will successfully parse an instance of that string:
'range' returns a parser that parses single characters within the range of the upper and lower character bounds (inclusive) given:
'negate' takes an existing single character parser, and returns once which parses anything but that which the original parser parsed. For example, 'negate(range("a", "z"))' will return a parser which parses anything except the letters from a to z inclusive:
'sequence' takes any number of parsers as arguments and returns a parser which suceeds if all the given parsers succeed in order. The AST it returns is an array of the results of each of the parsers.
'alternate' provides choice between parsers. It takes any number of parsers as arguments and will try each of them in order. The first one that succeeds results in a successful parse, and its result is the AST:
'repeat0' does the equivalent of '*' in regular expressions. It takes a parser and returns a parser which will parse zero or more occurrences of the original parser. The AST is an array containing the AST result of the original parser for each successful occurrence:
'repeat1' does the equivalent of '+' in regular expressions. It takes a parser and results one which will parse one or more occurences of the original parser:
'optional' takes a parser and returns one which matches exactly zero or one instances of the original. The AST result is 'false' for the case where there is no match or the result of the original parser if there is a match:
The AST produced by some of the generated parsers can be non-optimal. For example, a simple parser will produce an array of strings for each digit:
There are other combinators provided in the library but these basics do most of what is needed.
The 'example1.js' file shows a translation of some of the grammar productions in the EcmaScript grammar:
I'd like to extend the library a bit and provide more examples. Any comments or ideas would be appreciated.
Categories: javascript
git clone git://double.co.nz/git/jsparse.gitThe library pulls ideas from parser combinators written in various languages and is pretty simple. But even though it is a small amount of code it works quite well. The repository includes an example of parsing numbers and strings based on the grammar in the EcmaScript specification.
git clone http://double.co.nz/git/jsparse.git
A parser in this library is a function that takes an input string and returns a result object. The result object contains three fields. They are:
- remaining: the remaining part of the input string to be parsed
- matched: the part of the input string that was successfully parsed by the parser
- ast: The abstract syntax tree produced by the parsr
'token' is a combinator that takes a string and returns a parser that will successfully parse an instance of that string:
js> load("jsparse.js")
js> var p = token("begin")
js> uneval(p("begin ... end"))
({remaining:" ... end", matched:"begin", ast:"begin"})The AST produced by parser is the string that it parsed. 'range' returns a parser that parses single characters within the range of the upper and lower character bounds (inclusive) given:
js> var p = range("0", "9")
js> uneval(p("5"))
({remaining:"", matched:"5", ast:"5"})
js> uneval(p("a"))
false'negate' takes an existing single character parser, and returns once which parses anything but that which the original parser parsed. For example, 'negate(range("a", "z"))' will return a parser which parses anything except the letters from a to z inclusive:
js> var p = negate(range("a", "z"))
js> uneval(p("g"))
false
js> uneval(p("5"))
({remaining:"", matched:"5", ast:"5"})'sequence' takes any number of parsers as arguments and returns a parser which suceeds if all the given parsers succeed in order. The AST it returns is an array of the results of each of the parsers.
js> var p = sequence(token("a"), token("b"), token("c"))
js> uneval(p("abcdef"))
({remaining:"def", matched:"abc", ast:["a", "b", "c"]})
js> uneval(p("abdef"))
false'alternate' provides choice between parsers. It takes any number of parsers as arguments and will try each of them in order. The first one that succeeds results in a successful parse, and its result is the AST:
js> var p = alternate(token("a"), token("b"), token("c"))
js> uneval(p("a123"))
({remaining:"123", matched:"a", ast:"a"})
js> uneval(p("b123"))
({remaining:"123", matched:"b", ast:"b"})
js> uneval(p("c123"))
({remaining:"123", matched:"c", ast:"c"})
js> uneval(p("d123"))
false'repeat0' does the equivalent of '*' in regular expressions. It takes a parser and returns a parser which will parse zero or more occurrences of the original parser. The AST is an array containing the AST result of the original parser for each successful occurrence:
js> var p = repeat0(range("0", "9"))
js> uneval(p("12345abcd"))
({remaining:"abcd", matched:"12345", ast:["1", "2", "3", "4", "5"]})
js> uneval(p("123abcd"))
({remaining:"abcd", matched:"123", ast:["1", "2", "3"]})
js> uneval(p("abcd"))
({remaining:"abcd", matched:"", ast:[]})'repeat1' does the equivalent of '+' in regular expressions. It takes a parser and results one which will parse one or more occurences of the original parser:
js> var p = repeat1(range("0", "9"))
js> uneval(p("12345abcd"))
({remaining:"abcd", matched:"12345", ast:["1", "2", "3", "4", "5"]})
js> uneval(p("abcd"))
false'optional' takes a parser and returns one which matches exactly zero or one instances of the original. The AST result is 'false' for the case where there is no match or the result of the original parser if there is a match:
js> var p = sequence(optional(alternate("+", "-")), repeat1(range("0", "9")))
js> uneval(p("1234"))
({remaining:"", matched:"1234", ast:[false, ["1", "2", "3", "4"]]})
js> uneval(p("-1234"))
({remaining:"", matched:"-1234", ast:["-", ["1", "2", "3", "4"]]})
js> uneval(p("+1234"))
({remaining:"", matched:"+1234", ast:["+", ["1", "2", "3", "4"]]})
js> uneval(p("*1234"))
falseYou'll notice in this example that I pass "+" and "-" directly instead of token("+") and token("-"). The parsers in this library will automatically convert strings to parsers where needed to make for terser and more readable code.The AST produced by some of the generated parsers can be non-optimal. For example, a simple parser will produce an array of strings for each digit:
js> var p = repeat1(range("0", "9"))
js> p("123").ast
1,2,3
js> uneval(p("123").ast)
["1", "2", "3"]The 'action' combinator takes a parser and a function. The function is called with the result of the AST produced by the parser, and the result of the function becomes the new AST. For example, compare these two:js> var p = action(range("0", "9"), function(ast) { return parseInt(ast) })
js> uneval(p("1"))
({remaining:"", matched:"1", ast:1})
js> var p = range("0", "9")
js> uneval(p("1"))
({remaining:"", matched:"1", ast:"1"})In the first the result is an actual number, in the second it is a string. Any object can be returned and used for the AST. A abstract syntax tree for the parsed language for example. There are other combinators provided in the library but these basics do most of what is needed.
The 'example1.js' file shows a translation of some of the grammar productions in the EcmaScript grammar:
var zero
= action("0", function(ast) { return 0; });
var decimal_digit
= action(range("0", "9"), function(ast) { return parseInt(ast); });
var non_zero_digit
= action(range("1", "9"), function(ast) { return parseInt(ast); });
var decimal_digits
= repeat1(decimal_digit);
var decimal_integer_literal
= alternate(zero, sequence(non_zero_digit, optional(decimal_digits)));
var signed_integer
= alternate(decimal_digits,
sequence("+", decimal_digits),
sequence("-", decimal_digits));
var exponent_indicator
= alternate("e", "E");
var exponent_part
= sequence(exponent_indicator, signed_integer);
var decimal_literal =
alternate(sequence(decimal_integer_literal,
".",
optional(decimal_digits),
optional(exponent_part)),
sequence(".",
decimal_digits,
optional(exponent_part)),
sequence(decimal_integer_literal,
optional(exponent_part)));
var hex_digit
= alternate(range("0", "9"),
range("a", "f"),
range("A", "F"));
var hex_integer_literal
= sequence(alternate("0x", "0X"),
repeat1(hex_digit));
var numeric_literal
= alternate(hex_integer_literal, decimal_literal);
var single_escape_character
= alternate("'", "\"", "\\", "b", "f", "n", "r", "t", "v");
var non_escape_character
= negate(single_escape_character);
var character_escape_sequence
= alternate(single_escape_character, non_escape_character);
var hex_escape_sequence
= sequence("x", hex_digit, hex_digit);
var unicode_escape_sequence
= sequence("u", hex_digit, hex_digit, hex_digit, hex_digit);
var escape_sequence
= alternate(hex_escape_sequence,
unicode_escape_sequence,
character_escape_sequence);
var single_string_character
= alternate(negate(alternate("\'", "\\", "\r", "\n")),
sequence("\\", escape_sequence));
var double_string_character
= alternate(negate(alternate("\"", "\\", "\r", "\n")),
sequence("\\", escape_sequence));
var single_string_characters
= repeat1(single_string_character);
var double_string_characters
= repeat1(double_string_character);
var string_literal
= alternate(sequence("\"", optional(double_string_characters), "\""),
sequence("'", optional(single_string_characters), "'"));
var null_literal
= token("null");
var boolean_literal
= alternate("true", "false");
var literal
= alternate(null_literal,
boolean_literal,
numeric_literal,
string_literal);
I'd like to extend the library a bit and provide more examples. Any comments or ideas would be appreciated.
Categories: javascript

5 Comments:
Hi!
I have toyed with a same idea some time ago, here's what I came up with:
- parsers spent quite a lot of time and memory constructing new strings on a successful parse. Wrapped input in a list-like object, backed by original string, providing necessary operations:
function Stream(str, index) {
this.str = str;
this.index = index || 0;
}
Stream.prototype = {
str: '',
index: 0,
head: function() {
return this.str[this.index];
},
tail: function() {
return new Stream(this.str, this.index + 1);
},
toString: function() {
return this.str.slice(this.index);
}
}
- Each parser received two additional arguments, a success and a failure continuations. This has two benefits: first, it makes parsers even more composable and "pure".
Second, and more important, it allows us to prevent stack overflows on long input strings by converting the whole program into trampolined style: instead of calling each continuation directly, wrap its call in a lambda and return it to a top-level driver routine. This way you can even build a 'non-deterministic' parser, evaluating each branch in parallel.
Unfortunately, I quickly ran into problems like:
* You can't have two mutually-recursive parsers defined as variables, you must wrap them into functions instead. For example, the following doesn't work:
var additive = alt(seq(multitive, '+', multitive), multitive);
var multitive = alt(seq(primary, '*', multitive), primary);
var primary = alt(repeat1(digit), seq('(', additive, ')'));
Where's lazy evalution when you need it?
* Writing non-trivial programs in CPS without a type-checker is hard. At some point a most innocent-looking change used to bring the program to a halt because either a required level of lambdas was missing, or there were too many of them. Lack of concise syntax for lambdas doesn't help, too.
Hope you do better than me, although I suspect that javascript isn't really suited for full-on functional programming, after all :)
Alex.
PS. You can't have nice formatting in comments, can you?
Chris you always come up with interesting stuff I am somehow involved with ! I recently also toyed with Javascript (subset, for now) parsers. My approach was based on Top Down Operator Precedence as described by Douglas Corckford: http://javascript.crockford.com/tdop/tdop.html
The generated AST I turned into a JSON object for sending to a server and possible further processing. An example of it can be seen here: http://www.rsaccon.com/2007/08/erlang-based-javascript-compiler-there.html
Because parsing is a time consuming operation and e.g. in an online code editor (also working on that, at very early stage) you do it repeatedly) and it would block the UI, I am considering of doing that in the background with the help of google gears worker pool threads.
Roberto
Alex, my first iteration had something similar to your Stream object (Also called Stream!). I also tried using lazy evaluation and list of successes by simulating Scheme's delay and force.
Unfortunately Javascript's function syntax is too verbose to make that code look nice. In the end I ripped it all out and went for a simple approach. I'll probably redo something like Stream though as you're right that lots of copies are made. Maybe a simple 'string' lookalike that just uses index, length and the original string to be more lightweight.
I'm not sure if the combinator approach is a good for Javascript but we'll see how it goes.
The comment facilities in blogger are pretty basic - I couldn't find a nice way of formatting code either.
Roberto, that Javascript->Erlang compiler is great! I'd read the top down parsing article before and did consider taking a similar approach. But in the end I decided to explore how well combinators fit.
One thing I miss is pattern matching and destructuring operations like you get in ML.
OMeta could also be an interesting source of ideas.
Hello,
Since this post was about parser combinators in JS and Chris mentioned OMeta, I thought some of you might be interested in taking a look at the new implementation of OMeta, written in JS:
http://www.cs.ucla.edu/~awarth/ometa/ometa-js/
Cheers,
Alex
Very cool, thanks allesandro!
Post a Comment
<< Home