Wednesday, October 31, 2007

Javascript Packrat Parser

I've made some changes to my parser combinator library written in Javascript that I wrote about previously.

I was extending the example code from that post to parse more of the ECMAScript 3 grammar to test it out and was hitting performance issues. I ended up modifying the combinator library to be a Packrat style parser. The Parsing Expression Grammar Wikipedia page has a description of PEGs and what a Packrat parser does. Basically results of each parse step are memoized to prevent reparsing after backtracing, sacrificing memory for speed:
Any parsing expression grammar can be converted directly into a recursive descent parser[citation needed]. Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case.

By memoizing the results of intermediate parsing steps and ensuring that each parsing function is only invoked at most once at a given input position, however, it is possible to convert any parsing expression grammar into a packrat parser, which always runs in linear time at the cost of substantially greater storage space requirements.

A packrat parser[1] is a form of parser similar to a recursive descent parser in construction, except that during the parsing process it memoizes the intermediate results of all invocations of the mutually recursive parsing functions. Because of this memoization, a packrat parser has the ability to parse many context-free grammars and any parsing expression grammar (including some that do not represent context-free languages) in linear time.
I changed a few other things in the library to more closely map to the vocabulary of PEGs. 'alternative' is now called 'choice' for example. There are still quite a few loose ends to tidy up and documentation of course.

The updated library can be retrieved from my git repository:
git clone git://double.co.nz/git/jsparse.git
It can run in a browser but I've mostly been testing it using Mozilla Rhino.

I've included three basic examples. They all operate on the example expression grammar from the wikipedia article:
Value   := [0-9]+ / '(' Expr ')'
Product := Value (('*' / '/') Value)*
Sum := Product (('+' / '-') Product)*
Expr := Sum
The first, example1.js, is a direct translation of that grammer. It produces a pretty ugly default Abstract Syntax Tree however:
var Value = choice(repeat1(range('0','9')), Expr);
var Product = sequence(Value, repeat0(sequence(choice('*', '/'), Value)));
var Sum = sequence(Product, repeat0(sequence(choice('+', '-'), Product)));
var Expr = Sum;
The Expr parser can be called by passing it a string to be parsed wrapped in a 'ParserState' object. This object is used to keep track of the current parse position and the memoized results. A helper function, 'ps', can be used to construct it:
var result = Expr(ps("1+2*3"));
The second example, example2.js adds to this to produce a better AST. It also uses the 'chainl' parser combinator to handle grouping correctly. A quick online page demonstrating this example is here. Enter an expression matching the grammar (there is no error checking yet), and press the button to see the AST in JSON format.

The third example, example3.js, evaluates the expression as it parses instead of generating an AST. This is also available online to try.

I've also included in the repository the work in progress of the ECMAScript 3 grammar. It is not complete or correct yet but I use it for testing the library.

Based on what I've learnt from doing this I plan to revisit the way I did Parser Combinators in Factor.

Categories:

Labels:

Monday, October 29, 2007

Kapono International Ukulele Festival in Auckland this weekend

New Zealand's first ukulele festival is being held this weekend in Auckland. kiwiukulele.com has the details. It features international & local artists:
  • James Hill (Canada)
  • Matt Heena (USA)
  • Azo Bell (Australia)
  • Sione Aleki (Tonga)
  • Pi`ikea Clark(Hawaii)
  • Black Strings (Auckland)
  • Pacific Curls (Auckland)
  • Bill Sevesi. (New Zealand)
  • Chuck Upu (Cook Islands)
  • Big Muffin Serious Band (Waikato)
  • International Ukulele Orchestra (Wellington)
  • The Kiwileles (Combined Schools Orchestra)
Those of you working in the Mozilla Auckland office are sure to have heard the sounds of James Hill as I used some of his videos for testing <video> support!

The festival is at Mount Roskill Intermediate School, Auckland on Saturday, 3rd November. See the schedule for details. There's even a MySpace page dedicated to it.

Categories:

Wednesday, October 24, 2007

Cyclone: A Safe Dialect of C

Robert O'Callahan's recent post about ownership types, stack allocation and abstraction penalties has a comment pointing to the Cyclone programming language.

This is an interesting looking language in that it has many of the properties of C which make it popular for system level development:
Cyclone is like C: it has pointers and pointer arithmetic, structs, arrays, goto, manual memory management, and C’s preprocessor and syntax.
But it also includes higher level constructs and safer memory management:
Cyclone adds features such as pattern matching, algebraic datatypes, exceptions, region-based memory management, and optional garbage collection
The developers are promoting it as a safer dialect of C:
Cyclone is safe: pure Cyclone programs are not vulnerable to a wide class of bugs that plague C programs: buffer overflows, format string attacks, double free bugs, dangling pointer accesses, etc.
The user manual has a good description of the various features.

Categories:

Tuesday, October 23, 2007

ECMAScript 4 Overview Paper

An overview paper describing the ECMAScript 4 language features was announced on the es4-discuss mailing list:
I'm pleased to present you with an overview paper describing ES4 as the language currently stands. TG1 is no longer accepting proposals, we're working on the ES4 reference implementation, and we're expecting the standard to be finished in October 2008.
...
This paper is not a spec, it is *just* a detailed overview. Some features may be cut, others may be changed, and numerous details remain to be worked out, but by and large this is what TG1 expects the language to look like. Your comments on the paper are very much welcome. Please send bug reports directly to me, everything else to the list.
The document is available on the ECMAScript 4 language site in overview.pdf.

Categories:

Labels:

Wednesday, October 03, 2007

Theora needs an MSVC compatible inline assembler expert

The Theora project needs someone knowledgeable in the inline assembler syntax used in the Microsoft Visual C++ toolset.

Theora has some assembler portions written using the GNU assembler syntax. Some way of having these same optimizations working when building the library with the Microsoft toolset would be very useful.

Categories:

Tuesday, October 02, 2007

Javascript Parser Combinators

Cleaning up my hard drive I came across some old libraries I'd written. One of them was a simple set of parser combinators written in Javascript. I put it in a git repository in case they prove useful to someone:
git clone git://double.co.nz/git/jsparse.git
git clone http://double.co.nz/git/jsparse.git
The library pulls ideas from parser combinators written in various languages and is pretty simple. But even though it is a small amount of code it works quite well. The repository includes an example of parsing numbers and strings based on the grammar in the EcmaScript specification.

A parser in this library is a function that takes an input string and returns a result object. The result object contains three fields. They are:
  • remaining: the remaining part of the input string to be parsed
  • matched: the part of the input string that was successfully parsed by the parser
  • ast: The abstract syntax tree produced by the parsr
Parser combinators combine parsers together to enable parsing more complex grammars. A number of standard parsers and combinators are provided.

'token' is a combinator that takes a string and returns a parser that will successfully parse an instance of that string:
js> load("jsparse.js")
js> var p = token("begin")
js> uneval(p("begin ... end"))
({remaining:" ... end", matched:"begin", ast:"begin"})
The AST produced by parser is the string that it parsed.

'range' returns a parser that parses single characters within the range of the upper and lower character bounds (inclusive) given:
js> var p = range("0", "9")
js> uneval(p("5"))
({remaining:"", matched:"5", ast:"5"})
js> uneval(p("a"))
false

'negate' takes an existing single character parser, and returns once which parses anything but that which the original parser parsed. For example, 'negate(range("a", "z"))' will return a parser which parses anything except the letters from a to z inclusive:
js> var p = negate(range("a", "z"))
js> uneval(p("g"))
false
js> uneval(p("5"))
({remaining:"", matched:"5", ast:"5"})

'sequence' takes any number of parsers as arguments and returns a parser which suceeds if all the given parsers succeed in order. The AST it returns is an array of the results of each of the parsers.
js> var p = sequence(token("a"), token("b"), token("c"))
js> uneval(p("abcdef"))
({remaining:"def", matched:"abc", ast:["a", "b", "c"]})
js> uneval(p("abdef"))
false

'alternate' provides choice between parsers. It takes any number of parsers as arguments and will try each of them in order. The first one that succeeds results in a successful parse, and its result is the AST:
js> var p = alternate(token("a"), token("b"), token("c"))
js> uneval(p("a123"))
({remaining:"123", matched:"a", ast:"a"})
js> uneval(p("b123"))
({remaining:"123", matched:"b", ast:"b"})
js> uneval(p("c123"))
({remaining:"123", matched:"c", ast:"c"})
js> uneval(p("d123"))
false

'repeat0' does the equivalent of '*' in regular expressions. It takes a parser and returns a parser which will parse zero or more occurrences of the original parser. The AST is an array containing the AST result of the original parser for each successful occurrence:
js> var p = repeat0(range("0", "9"))
js> uneval(p("12345abcd"))
({remaining:"abcd", matched:"12345", ast:["1", "2", "3", "4", "5"]})
js> uneval(p("123abcd"))
({remaining:"abcd", matched:"123", ast:["1", "2", "3"]})
js> uneval(p("abcd"))
({remaining:"abcd", matched:"", ast:[]})

'repeat1' does the equivalent of '+' in regular expressions. It takes a parser and results one which will parse one or more occurences of the original parser:
js> var p = repeat1(range("0", "9"))
js> uneval(p("12345abcd"))
({remaining:"abcd", matched:"12345", ast:["1", "2", "3", "4", "5"]})
js> uneval(p("abcd"))
false

'optional' takes a parser and returns one which matches exactly zero or one instances of the original. The AST result is 'false' for the case where there is no match or the result of the original parser if there is a match:
js> var p = sequence(optional(alternate("+", "-")), repeat1(range("0", "9")))
js> uneval(p("1234"))
({remaining:"", matched:"1234", ast:[false, ["1", "2", "3", "4"]]})
js> uneval(p("-1234"))
({remaining:"", matched:"-1234", ast:["-", ["1", "2", "3", "4"]]})
js> uneval(p("+1234"))
({remaining:"", matched:"+1234", ast:["+", ["1", "2", "3", "4"]]})
js> uneval(p("*1234"))
false
You'll notice in this example that I pass "+" and "-" directly instead of token("+") and token("-"). The parsers in this library will automatically convert strings to parsers where needed to make for terser and more readable code.

The AST produced by some of the generated parsers can be non-optimal. For example, a simple parser will produce an array of strings for each digit:
js> var p = repeat1(range("0", "9"))
js> p("123").ast
1,2,3
js> uneval(p("123").ast)
["1", "2", "3"]
The 'action' combinator takes a parser and a function. The function is called with the result of the AST produced by the parser, and the result of the function becomes the new AST. For example, compare these two:
js> var p = action(range("0", "9"), function(ast) { return parseInt(ast) })
js> uneval(p("1"))
({remaining:"", matched:"1", ast:1})
js> var p = range("0", "9")
js> uneval(p("1"))
({remaining:"", matched:"1", ast:"1"})
In the first the result is an actual number, in the second it is a string. Any object can be returned and used for the AST. A abstract syntax tree for the parsed language for example.

There are other combinators provided in the library but these basics do most of what is needed.

The 'example1.js' file shows a translation of some of the grammar productions in the EcmaScript grammar:
var zero 
= action("0", function(ast) { return 0; });
var decimal_digit
= action(range("0", "9"), function(ast) { return parseInt(ast); });
var non_zero_digit
= action(range("1", "9"), function(ast) { return parseInt(ast); });
var decimal_digits
= repeat1(decimal_digit);
var decimal_integer_literal
= alternate(zero, sequence(non_zero_digit, optional(decimal_digits)));
var signed_integer
= alternate(decimal_digits,
sequence("+", decimal_digits),
sequence("-", decimal_digits));
var exponent_indicator
= alternate("e", "E");
var exponent_part
= sequence(exponent_indicator, signed_integer);
var decimal_literal =
alternate(sequence(decimal_integer_literal,
".",
optional(decimal_digits),
optional(exponent_part)),
sequence(".",
decimal_digits,
optional(exponent_part)),
sequence(decimal_integer_literal,
optional(exponent_part)));

var hex_digit
= alternate(range("0", "9"),
range("a", "f"),
range("A", "F"));
var hex_integer_literal
= sequence(alternate("0x", "0X"),
repeat1(hex_digit));

var numeric_literal
= alternate(hex_integer_literal, decimal_literal);

var single_escape_character
= alternate("'", "\"", "\\", "b", "f", "n", "r", "t", "v");
var non_escape_character
= negate(single_escape_character);
var character_escape_sequence
= alternate(single_escape_character, non_escape_character);
var hex_escape_sequence
= sequence("x", hex_digit, hex_digit);
var unicode_escape_sequence
= sequence("u", hex_digit, hex_digit, hex_digit, hex_digit);
var escape_sequence
= alternate(hex_escape_sequence,
unicode_escape_sequence,
character_escape_sequence);
var single_string_character
= alternate(negate(alternate("\'", "\\", "\r", "\n")),
sequence("\\", escape_sequence));
var double_string_character
= alternate(negate(alternate("\"", "\\", "\r", "\n")),
sequence("\\", escape_sequence));
var single_string_characters
= repeat1(single_string_character);
var double_string_characters
= repeat1(double_string_character);
var string_literal
= alternate(sequence("\"", optional(double_string_characters), "\""),
sequence("'", optional(single_string_characters), "'"));

var null_literal
= token("null");
var boolean_literal
= alternate("true", "false");

var literal
= alternate(null_literal,
boolean_literal,
numeric_literal,
string_literal);


I'd like to extend the library a bit and provide more examples. Any comments or ideas would be appreciated.

Categories:

Monday, October 01, 2007

Standard ML to Javascript compiler

smltojs is a compiler from Standard ML to Javascript. According to the page it has support for all of Standard ML.

Since the reference implementation of Ecmascript 4 is written in Standard ML it would be interesting to see if it can be built using this compiler. That would provide an es4 implementation that runs in the browser based off the reference implementation.

Categories:

Labels:

The Poker Rollercoaster

It's been about six months since I last updated on my Poker progress. As I mentioned in that post I'd switched to mainly playing cash games rather than tournaments. This has had a bit of a rollercoaster effect with up and down swings. Thankfully more ups and downs. I've not had to deposit into my account since my first initial deposit loss about 18 months ago.

May, June and July were all profitable months playing cash games, while August was a small loss. The reason for the loss was me experimenting with playing different stakes and a slightly different type of game (Shorthanded 6 max No Limit instead of my usual 9 handed full ring No Limit). I struggled with the change, dropped a few buyins, and switched back to full ring to recover in September.

September started off very well, winning back my August loss and doing better than my previous two months combined - all within the first two weeks! I moved up in stakes since my online bankroll could handle it and again struggled with the change in game. I dropped a few buy-ins before I adjusted but things have settled back down again to a steady profit at the new stakes. The graph on the right shows the ups and downs over last 6 months or so.

In the graph you'll see a large jump up in September. This was thanks to coming 2nd in Full Tilt Poker's 'Midnight Madness' tournament a couple of weeks ago, wining $1,800 for an $11 entry fee. Thankfully the 'Midnight Madness' actually started in NZ time at a more reasonable 4pm, which makes it possible to play on the weekend. It took about 4 hours to play through. That makes for a nice hourly rate, but if you take into account the times I've entered and not got anywhere...Still I'm way ahead in profits for that tournament thanks to that placing.

Categories:

Firefox Video Element Patch Version 5

Version 5 of the patch to add Video element support to Firefox is up. This version rewrites a lot of the code to match the pseudocode in the WHATWG specification.

As well as bug fixes it contains support for the 'autoplay' attribute and loads the first frame of the video when the element is first displayed. This version also runs the SVG demo I made a video of in an earlier post. I tested the demo under Windows, Linux and Mac OS X and it seemed to work fine. That demo, and others, is available on my video patch test page. Binaries are also available on that page.

Don't forget you can track progress using the git repository if you don't want to wait for the patches:
git clone git://double.co.nz/git/video.git


Categories: ,

Labels: