Architecturally Elegant - lahteenmaki.net

Applicative Parsing

Jyri-Matti Lähteenmäki

2018-02-18

Tags: haskell

Years ago in the university I learned something about creating parsers. Not much, though, and it all seemed fairly difficult with Flex/Bison and friends. Later in life, when getting familiar with Scala and Haskell, I learned about combinatorial parsing, which finally made parsing feel easy.

I’ve been waiting for some spare time to learn some monadic parser combinator library, since they feel like state of the art, and a good thing to spend some time on. Then I learned about monads, and that most parsing doesn’t actually need monadic actions.

Then I heared about Applicative Parsing, and learned that even the state of the art monadic parser combinator libraries in Haskell actually come with applicative interfaces.

So, what’s going on? What does parsing really have to do with abstractions like Applicative and Monad?

Let’s try to parse a small language without any kind of parsing library to see what we can come up with. Everybody seems to like lisps, so let’s parse this:

First the regular headers, without implicit Prelude to see what we actually use:

What is the syntax like for our toy language? A term is either a plain string, or either a concatenation or printing inside parentheses. Let’s make up an imaginary operator <|> to represent alternatives:

“in parentheses” means that there’s a single term between opening and closing parenthesis character. Let’s also make up an operator <*> to represent sequential composition:

What is left is to define the syntax for out three kinds of terms. A string is just zero or more characters between opening and closing double quote. Let's ignore any kind of escaping, and just forbid using double quotations marks inside strings:

A concatenation is just the + character followed by zero or more terms:

Printing is expressed with the string print followed by the term to print:

Parsing should eventually give us a data structure called an Abstract Syntax Tree, which we then could process further. A Haskell type for the nodes of our tree would be:

Now we need a way to convert our syntax to a tree of Terms. Let’s make up another operator <$> that converts a parsing result to a data type of our choice:

If you follow the imaginary types of these expressions, you’ll notice that if our expressions would be parsers that produced the data they parsed, the types would actually match. Expect that our sequential composition would produce too much results. For example, in parsing print we wouldn’t actually be interested in receiving the text print as long as it’s present in the syntax, we’d only be interested in the right hand side of the composition. Let’s solve this by defining two variants for our sequential composition, which ignore one side and only return the other:

Now we can improve our definitions:

In the final syntax, the terms are often separated by white space. One way to handle this would be to define that a term can always starts with some white space:

To actually be able to parse a string of characters and produce something else, we’ll need to implement the most primitive of our undefineds, namely char. For this we need to think about what our Parser would actually be like. One definition is something which takes a string of characters and turns it into a list of things and remaining characters:

Now we can define a primitive parser which either accepts or rejects a single character based on a given predicate, which can be used to implement the parsers accepting a single character:

At this point we have actually implemented the whole logic to actually parse our toy language. And we haven’t used a parsing library, monads or applicatives or pretty much anything! The only thing missing is a couple of undefineds: our five operators <|>, <*>, <*, *>, <$> as well as two combinators many and string.

We could implements these ourselves, but if we take a close look at the names and semantics, we might recognize these as functions from Functor and Applicative. Let’s see if we can use these fundamental abstractions to implement the remaining pieces.

First we need a Functor instance, which we can derive:

Then instances for Applicative and Alternative, which we have to write manually. (See One more thing for a way to automatically derive these):

And now the missing implementations are available from the Functor and Applicative modules:

The remaining string we'd have to implement ourselves, or use Traversable:

Now we can implement the main parsing function and derive a Show instance to get a printable AST out:

which actually works!

This is Applicative Parsing.

The Applicative interface (together with Alternative and Functor) happens to provide most what is needed to perform parsing. As the Applicative is a really abstract and general purpose interface, it really makes me wonder why combinatorial applicative parsing libraries aren't more popular through various programming languages.

I mentioned in the beginning that many existing monadic parser combinator libraries implement the applicative interface. This means, that if (or when) I would like to utilize a performant, battle-tested parser implementation providing nice error messages with line numbers, instead of this kind of self made junk, I can do it pretty much by just modifying the import statements. For example, to make this example work with MegaParsec (6.x), I can do it like this:

This is the whole implementation.

The big point is, that in order to do parsing, you don’t actually need to learn a parser generator or a parser combinator library. You only need to learn Functor and Applicative interfaces, which should already be (but unfortunately are not) taught in every university program related to software development.

Monads are needed only when building a context sensitive parser, where a step requires some information from previous steps. Using the Applicative interface leaves (at least in theory) more optimisation possibilities for the implementation. If you want to build an understanding of monads in general, I’d recommend browsing through my own presentations since unfortunately, I believe, most Monad tutorials are missing the point even more than I am ;)

One more thing

Recently I ran into a related blog post, which defined a way to reduce code by deriving the Applicative instances:

Please leave any questions and suggestions in the comments! Especially if you think I have completely missed some point, and have a chance to learn something :)

The (compiling and runnable) code examples are available in Github: https://github.com/jyrimatti/app-parsing