April 23rd, 2013 by Hao Lian

Dive into parser combinators: parsing search queries with F# and FParsec in Kiln

We open on: the past

The year: 2012. The problem: search. With a new release of Kiln, search is now forefront and center. You can zip around repositories or code with a simple tap of the keys, and boy is the future bright.

Powering search was our search engine. And powering it was our search-query parser, a couple hundred lines of code that parsed a query into a list of keywords and filters. For example, if you asked of Kiln

foo bar project:Eggs date:yesterday..now author:Tyler

Kiln finds all the commits, by people named “Tyler,” to a repository in projects named “Eggs” since yesterday with the words “foo” or “bar” in the commit message.

But try to search "foo bar" and you would be disappointed. The unspoken rule of the internet is that surrounding two words in quotation marks should make a search engine look for both words as one phrase instead of two separate words. So "foo bar" should match the string “boy I had a lot of foo bar pie” but not the string “foo and bar are two friends from way back when.” Pretty goofy rule, but the internet is a goofy place.

It’s 2012, and Kiln does not have phrase search. We left it on the cutting floor to make room for everything else we wanted, and we regret it. Life moves on.

And we cut to: the present

The year: 2013. Not having phrase search: more and more irritating. Having migrated our full-text indexing to elasticsearch, phrase searches are not only possible but easy. So you, being a developer on the Kiln team, don glasses and open the .cs file containing the query parser. Written in C# and presented for your consideration is a jumble of grammar rules and intermediate parse trees, a jungle of loops and state. A flock of crows take off from a nearby tree. You close the file.

“This seems like the ideal intern project,” you think to yourself. “It would be a shame to not allow someone else to rewrite this.”

Just then, Andrew Pritchard walks by your office. Andrew Pritchard was our summer 2012 intern who worked on a dazzling array of Kiln features, including phrase search. We will borrow a hypothetical version of him. Look at him, walking with the smooth confidence of a man not yet burdened by string parsing.

“Help us, Hypothetical Andrew Pritchard,” we said. “What do you know about parsing?”

H.A.P. points you at FParsec, a parser combinator library for F#. He begins erasing your whiteboard and drawing diagrams while you wonder what he is talking about.

“Hold on,” you say, slapping the multi-colored markers out of his hand. “I have many reservations about what’s happening right now but here’s the biggest one. F# is that that new functional-programming language from Microsoft right? Kiln is a giant ASP.NET MVC application that uses C#. There is no room for F#, Hypothetical Andrew Pritchard, you crazy lovable human being you.”

“No,” he replies. You two stare at each other for a while.

It turns out that .NET’s Common Language Runtime, plus increasingly better F# support in Visual Studio, lets you create an F# library inside your solution and reference it from a C# project. There are some quirks: ReSharper support for F# is ongoing, F# files have to be sorted in the solution tree in the order you want them to be compiled, and F# collection types map awkwardly from and to C#—to name three big ones. Overall though the experience is surprisingly pleasant. I say in the year 2013 you can (and should) alternate between F# and C# depending on the problems you are solving.

We created an F# project with the source code in this blog post if you would like to follow along. If you are not familiar with F#, fear not! By and large the F# syntax can be intuited; for a look-see, Wikipedia also has a buffet of code snippets. On my part I’ll use highly descriptive variable names and mention C# analogues to F# features when possible.

“FParsec is great, but we need F#. No biggie,” H.A.P says, shrugging his shoulders. “Besides, F# is functional, which means it’s ideal for a self-contained, computer-science-y project like string parsing.”

“It is fun.”

“You will like it.”

You are sort of convinced. In any case, he has covered your whiteboard in figures and symbols. He looks at you, then looks at the board. He walks over and gently pushes you out of your chair. You get up, brush yourself off, and read the notes on the whiteboard as he begins typing into your computer. Which notes are:

Parsers

In the world of parser combinators, a parser is a function that takes an initial state s and returns a final state s' in addition to an output object:

p .>>. q

If the output is of type 't, we say the parser is of type 't. The black arrow shows the output.

The state is a set of facts about the world.

Let’s give ourselves a parser p that finds the string hello inside other strings. And let’s run p on the string “hello, world”. Then the state of the world prior to running p (the initial state s) as indicated by the first blue arrow has these facts:

  • Our string is “hello, world”;
  • Our current position is index 0;
  • No errors so far;
  • etc.

The state of the world after running p (the final state s') as indicated by the second blue arrow has these other facts:

  • String: still “hello, world”;
  • Current position: now index 5 (the comma);
  • Errors: still none;
  • etc.

We say that the parser has consumed the string “hello”, which means the parser has read the string “hello” and has adjusted the current position accordingly.

A parser can also fail. For example, if we run p on the string “salutations, world” an exception is raised as soon as the parser realizes there is no “hello”, and the whole thing is called off.

FParsec comes with a smörgåsbord of helper functions for parsing common chores. Some that are relevant to us:

These higher-order functions take arguments and return a parser. (By the way, fun c -> c <> ' ' is just a lambda. The C# equivalent, which is actually terser!, would be c => c != ' '.)

pchar '"' and manySatisfy (fun c -> c <> ' ') are parsers of type string, as indicated by their type Parser<string, unit>, which makes sense since they consume a string and then immediately output it. (You may be wondering what that second generic type argument unit is for. [unit is the same as void in C#.] We will leave the resolution to that mystery to the excellent FParsec documentation.)

Parser combinators

A parser combinator is function that takes multiple parsers and returns a new one. Every parser you’ll ever write in FParsec will be a pyramid with small simple parsers at the bottom building up to something big and wonderful at the top.

FParsec comes with a zoo of parser combinators. Some are functions, but most are operators. In F#, you can define custom operators and FParsec is nothing if not a testament to that feature. This lets us write, for example, p <|> q instead of (say, hypothetically) alternate p q. There is a trade-off though: the library can be intimidating at first. It may seem like a place of tall runes and cold magic, where the flesh and soul are tested, and light is dim, joyless. But armed with logic and courage, you will soon be able to write fairly complicated parsers in only a few terse lines.

So, using these primitives, little itty parsers get glued together into a big ol’ parser, which is what we will be doing for Kiln. You can take apart the pieces when you are reading and understanding it, and you can build up the pieces when you are writing and debugging it. You might even dare to reuse a parser or two. The same cannot be said for the kind of parsers out there in the harsh imperative world—try refactoring that while-loop with the big switch statement into modular components without messing up the big global state object.

And also: no preprocessor.

Some combinators that are relevant to us:

Sequencing with >>. and .>>

val (>>.):   Parser<'a,'u> -> Parser<'b,'u> -> Parser<'b,'u>

Here’s what you do if you’re >>.: First run p and throw away its output (black arrow). Then, following the state (blue arrows), run q and take its output.

p >>. q

The cousin of >>. is .>>, which throws away the second output and takes the first:

val (.>>):   Parser<'a,'u> -> Parser<'b,'u> -> Parser<'a,'u>

p .>> q

Tuple sequencing with .>>. (or “two eyes, full hearts”)

val (.>>.):  Parser<'a,'u> -> Parser<'b,'u> -> Parser<('a * 'b),'u>

First run p. Then run q. Take both outputs and tuple them, which we denote by multiplying the two types together, just as F# does. (For example, a tuple of an integer and a string in F# has a type int * string, which is just shorthand for Tuple<int, string> like you might see in C#.)

p .>>. q

Alternation with <|>

val (<|>): Parser<'a,'u> -> Parser<'a,'u> -> Parser<'a,'u>

Say you have two parsers of the same type 't. Run p first. If it fails without altering the state, run q instead.

p <|> q

If you recall from compilers, you can write down a grammar in a hoity-toity syntax that looks like

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
while_stmt: 'while' test ':' suite ['else' ':' suite]
for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]
try_stmt: ('try' ':' suite
           ((except_clause ':' suite)+
            ['else' ':' suite]
            ['finally' ':' suite] |
           'finally' ':' suite))
with_stmt: 'with' with_item (',' with_item)*  ':' suite
with_item: test ['as' expr]

You can think of the <|> operator from FParsec as a rough correspondence to the | operator from Backus-Naur syntax. Note that <|> will fail if p alters the parser state. If backtracking (trying a parser without actually altering the state) is what you are looking for, seek FParsec’s excellent documentation on parsing alternatives.

Separation with sepBy

val sepBy: Parser<'a,'u> -> Parser<'b,'u> -> Parser<'a list,'u>

The combinator to turn to when parsing a list of things:

sepBy

  1. Allocate a list of type 'p list.
  2. On failure, stop and output the list.
  3. On success, append the output of type 'p to the list. Then run sep and then go to step 2.

In F#, generic types’ arguments go before the generic type. So the type of the output is 'a list instead of, in C#, List<'a>. Also in F#, as you might have noticed, type variables are prefixed with a single quote because single quotes go out on Friday night and have all sorts of fun.

Function application with |>>

val (|>>): Parser<'a,'u> -> ('a -> 'b) -> Parser<'b,'u>

p |>> f

Run p once, which outputs an object of type 't. Pass said object to f: 'a -> 'b. Result: output of type 'b.

The code

“OK,” you say. “This is snazzy, but now what?”

H.A.P. is kneading his temples back and forth with his fingers. “This is what I am thinking,” he says. You wonder if he could think less dramatically.

  • A keyword is a phrase or just a word.
    • A phrase is a bunch of characters surrounded by quotation marks.
    • A word is a bunch of non-space characters.
  • A filter is a word (from a whitelist of approved words) followed by a colon followed by an argument (which is a word or phrase).
  • Our input is a list of keywords and filters.

Keywords

Since elasticsearch differentiates between phrase and word searches, we need to encode the difference using types so that the code that calls into our query parser knows what to do. In F#, the way we model this is with a union type, which is analogous to union in the C programming language. So you push H.A.P. out of the way and type:

type Keyword = Word of String | Phrase of string

In F#, you should read this as “Keyword is a type with two constructors. Constructor number one is Word, which takes a string and returns a Keyword. (Constructors are just functions.) Constructor number two is Phrase, which takes a string and returns an Keyword.”

A word is a string of one or more characters that are not spaces, so we pull out the built-in many1Satisfy from earlier:

let notSpace c = c <> ' '
let word = many1Satisfy notSpace

A phrase is trickier. A phrase is a string of characters (let’s call them innards) between two quotation marks. Innards cannot contain a quotation mark unless it is escaped by a backslash. For example: "Maria said, \"I love you,\" to Mark. Mark gasped." would be a single valid phrase.

So a rigorous definition: one innard is either (1) parse a non-quotation-mark character or (2) parse a backslash followed by a quotation mark. This calls for another <|>.

let phraseEscape = pchar '\\' >>. pchar '"'
let phraseInnard = phraseEscape <|> noneOf "\""

And a phrase is just many innards surrounded by quotation marks:

let phraseInnards = manyChars phraseInnard
let phrase = between (pchar '"') (pchar '"') phraseInnards

between left right middle is another built-in parser. As the documentation for between notes, it is really just shortand for left >>. middle .>> right.

We now have two parsers of type string: word and phrase. All that remains is to lift them into the Keyword type:

let keyword = (phrase |>> Phrase) <|> (word |>> Word)

“Quiz:”, announces H.A.P. “This will not work if we switched around the arguments to <|> i.e. (word |>> Word) <|> (phrase |>> Phrase). Why?”

Who just gives out quiz questions? What kind of person does that? Quiz: What kind of person does that?

Filters

Here are all the filters we have in Kiln search:

type Filter =
    | Author of string
    | File of string
    | Project of string
    | Repo of string

Remember, the right-hand side of a type definition lists the constructors for that type and also the arguments of the constructor. For example, we can infer from this that Author has type string -> Filter.

There are three parts to a filter in the Kiln search-query syntax:

  1. The name of the filter. Like author.
  2. The colon.
  3. The filter’s argument(s). Like Tyler in author:Tyler.
  4. We also want to allow phrase searches like author:"Tyler" or author:"Tyler H". Because phrases are cool.

The code:

let filterName = (* to be written *)
let findFilter = function
    | "author" -> Author
    | "file" -> File
    | "project" -> Project
    | "repo" -> Repo
    | _ -> raise <| Exception "Invalid filter name"
let pipeline (f, x) = f x
let filter =
    (filterName |>> findFilter)
        .>> (pchar ':')
        .>>. (phrase <|> word)
        |>> pipeline

The strategy:

  1. Parse the filter name. For example, given author:"Tyler", consume author.
  2. Find the corresponding Filter constructor (findFilter). For example, author means Author: string -> Filter. If the filter name is not one of the ones we recognize, the parser should fail.
  3. Parse a colon, throw it away (.>>). In our example, this now leaves us with just "Tyler".
  4. Parse the filter argument, which can be a word or, now, a phrase.
  5. With .>>., tuple the Filter constructor and the string argument together.
  6. Pass the tuple to pipeline, which applies the filter constructor (string -> Filter) to the filter argument (string) and returns a Filter.
  7. We now have a Parser<Filter, unit>!

Which looks like:

filter

Remember that the blue arrows track how the string is consumed, which in this case is fairly simple, and the black arrows track how the output is manipulated, which is more complicated.

FParsec’s operators sure are a double-edged sword: the code for filter is hard to read without knowing what the operators do. But we did just compose a smart complicated parser without breaking a sweat. Overall I would say the library hurts the bright-eyed beginner but benefits the adept novice—but look at you guys, you guys are all adept novices already and I am proud of you.

Speaking of getting advanced in F#, H.A.P. has taken your wireless keyboard and typed:

let filterName =
    ["author"; "file"; "project"; "repo"]
    |> Seq.map pstring
    |> choice

“Get out of here, H.A.P. Do you ever drink water?” He stares out the window, licking his lips.

You should know that x |> f in F# is the same as f x. Futhermore, Seq.map f maps a function over a list and returns the new list, and choice [p, q, r] is roughly the same as p <|> q <|> r. While you are puzzling this all out, H.A.P. has placed one hand gently on the window, the way people do at inmate visitations.

Putting keywords and filters together

We want to store keywords and filters together in one list for simplicity’s sake, so let’s introduce another union type:

type Atom = KeywordAtom of Keyword | FilterAtom of Filter

Example: in the query

"foo bar" project:Eggs date:yesterday..now

The first atom is a phrase foo bar. The second and third atoms are filters.

And so we turn to the <|> operator one last time:

let atom = (filter |>> FilterAtom) <|> (keyword |>> KeywordAtom)

This has the type Parser<Atom, unit>, as desired. “Always satisfying to have the type of a function match the name of the function, isn’t it?” H.A.P. says as he reclines in your office chair. You were not aware your office chair had reclining capabilities.

It’s a magical world, Hobbes, ol’ buddy

H.A.P. stands up from the nap he was taking on the floor. “I am proud as well,” he says. He takes out an “A+” sticker and puts it on your hand. With a flourish of his hat—which was your hat, that he took—he leaves your office.

There is not much left to do. Now that we can parse an atom, we can parse a bunch of atoms separated by spaces, which is the parser we wanted all along:

let spaces = many1Satisfy (fun c -> c = ' ')
let parser = sepBy atom spaces

And that’s it. To run the parser, FParsec comes with a library function called run:

exception ParseError of string

let parse input =
    match run parser input with
    | Success (atoms, _, _) -> atoms
    | Failure (error, _, _) -> raise (ParseError error)

run returns a success or an error (another union type!). On success, atoms is of type atom list and we simply return it. On error, error is of type string and, since this is a toy parser, we simply raise an exception. (With match, we are using pattern matching, which alone is reason enough to try out F#. It’s the same trick that made findFilter so short.)

And what’s the type of parse? parse: string -> Atom list. That’s jazz, baby.

Sure there are some i’s to dot and t’s to cross:

  • Parsing date-range filters. The date filter constructor will have the type DateTime -> DateTime -> Filter so it is not as simple as just extending filterName and findFilter.
  • Part of that is also translating human date descriptions like “yesterday” and “today” into usable System.DateTime objects.
  • Better error reporting, about which there is much to say.
  • Using lookahead and backtracking to make the parser more robust.
  • Writing a blog post, to which you commit not knowing that it requires you to learn more Photoshop than you ever wanted to.

But it is nice outside and you do love your family.

I pushed the source code to a public Kiln repository, which contains not only the code in this blog post but also some of the tests we use for our in-production parser; we are a big fan of unit testing, but who isn’t these days? (You can run the tests by hitting Test -> Run -> All Tests in Visual Studio.) I took a crack at implementing a date-range parser as a quick one-hour project but it is fairly fragile—you can do better! Play around! It’s what H.A.P. would want, with those big sad eyes of his.

And afterward, sign up for Kiln Harmony to see all the other cool stuff we put into our last release.

Hao Lian is a programmer on the Kiln team. Did you know you can create custom vibrations in iOS?