Tuesday, June 16, 2015

The joys of parsing a toy programming language

In my personal time, I've been playing at building a toy programming language. Progress has been slow, but things are finally starting to come together. Last night, I ended up finding an interesting bug, and I wanted to document it here.

The syntax of my language isn't super complicated. Here's an example of a function:

def foo(a, b) = a + b

Because this language is a so-called modern language, it has to support lambdas as well. Here's an example:

val bar = (a, b) => a + b

You can call functions and lambdas using the same syntax:

foo(5, 7)
bar(5, 7)

Naturally, since lambdas are first-class, we need to support more complicated call syntax as well. In particular, this should be allowed:

((a, b) => a + b)(5, 7)

In order to support that, here's the section of the grammar that describes function calls (you can imagine the rest of the grammer):

FnCall    :=  Callable lparen Args rParen
Callable  :=  identifier
          :=  lParen Expression rParen
Args      :=  identifier MoreArgs
          :=  epsilon
MoreArgs  :=  comma identifier MoreArgs
          :=  epsilon

My compiler was able to successfully compile and run this extremely simple program:

val bar = (a, b) => a + b
bar(5, 7)

But imagine my surprise when it failed to compile this program:

val bar = (a, b) => a + b
(bar)(5, 7)

Looking at the parse tree, it was obvious what had happened. It turns out that my grammar was ambiguous, and the parser had chosen this interpretation:

val bar = (a, b) => a + (b(bar))(5, 7)

Rather than parsing this as a definition and a corresponding expression, it instead parsed it as a single definition. It assumed that b was a function of one parameter, which returned a function of two parameters. Even this simpler grammar has the same problem:

FnCall    :=  identifier lparen Args rParen
Args      :=  identifier MoreArgs
          :=  epsilon
MoreArgs  :=  comma identifier MoreArgs
          :=  epsilon

Given this program:

val bar = (a, b) => a + b
(5 + 7) * 2

This could (and would) get parsed as:

val bar = (a, b) => a + b(5 + 7) * 2

(It's worth noting that, if I had been using a shift / reduce parser, this would have been detected as a S/R conflict. But I'm not using a S/R parser; I'm using the Scala parser combinator library, mostly because it was easy to get started and I haven't outgrown it yet.)

Looking at Scala, from which I stole a lot of my syntax, I found that newline handling is a bit complicated, though the rules are laid out plainly in section 1.2 of the Scala language spec. Essentially, a newline can be treated either as plain whitespace or as a statement terminator, depending on the context. Within a parenthesized expression, newlines are always treated as whitespace. Outside an expression, newlines are treated as statement terminators if they appear between a token that could end a statement and a token that could begin a statement. One one hand, having written a fair amount of Scala, the rules feel pretty natural and intuitive. However, you do end up with strange behavior at the edges, as the following example demonstrates:

(succ
  (5))   => 6

{succ
  (5)}   => 5

On the other hand, Haskell (with which I'm admittedly not very familiar) appears to use indentation level to determine expression grouping. That is, while this is valid:

main = putStrLn "hi"

and this is also valid:

main = putStrLn 
 "hi"

this is not:

main = putStrLn 
"hi"

Context is not considered, so unlike Scala, this still fails:

main = (putStrLn 
"hi")

Both of these approaches have merit. Haskell's approach is natural enough, though there are some unambiguous representations that it would reject. Scala's approach is more permissive, at the cost of more complicated implementation rules and more surprising behavior. I can't tell which is more appropriate for my language.

For now, I took neither approach. I realized that, as far as I know, I have but one case where expressions are ambiguous - when the parameters to a function appear on a separate line from the Callable itself. It was possible for me to simply require that the whitespace between the Callable and its parameters not include any newlines. So an identifier at the end of a line will NEVER be considered to be a function call. I don't think this will remain this way forever, but I think it will get me unstuck.

No comments: