Applicative Parsing
2018-02-18
Years ago in the university I learned something about creating parsers. Not much, though, and it all seemed fairly difficult with Flex/Bison and friends. Later in life, when getting familiar with Scala and Haskell, I learned about combinatorial parsing, which finally made parsing feel easy.
I’ve been waiting for some spare time to learn some monadic parser combinator library, since they feel like state of the art, and a good thing to spend some time on. Then I learned about monads, and that most parsing doesn’t actually need monadic actions.
Then I heared about Applicative Parsing, and learned that even the state of the art monadic parser combinator libraries in Haskell actually come with applicative interfaces.
So, what’s going on? What does parsing really have to do with abstractions like Applicative and Monad?
Let’s try to parse a small language without any kind of parsing library to see what we can come up with. Everybody seems to like lisps, so let’s parse this:
+ "Hello " "World!”)) (print (
First the regular headers, without implicit Prelude
to
see what we actually use:
{-# LANGUAGE NoImplicitPrelude, LambdaCase, StandaloneDeriving, DeriveFunctor #-}
module Parser where
import Prelude (undefined,String,Char,Bool,($),(.),(==),(/=),Show,Maybe(Just,Nothing))
What is the syntax like for our toy language? A term is either a
plain string, or either a concatenation or printing inside parentheses.
Let’s make up an imaginary operator <|>
to represent
alternatives:
= str <|> inparens (concat <|> print)
term <|>) = undefined (
“in parentheses” means that there’s a single term between opening and
closing parenthesis character. Let’s also make up an operator
<*>
to represent sequential composition:
= char '(' <*> t <*> char ')'
inparens t = undefined
char <*>) = undefined (
What is left is to define the syntax for out three kinds of terms. A string is just zero or more characters between opening and closing double quote. Let's ignore any kind of escaping, and just forbid using double quotations marks inside strings:
= char '"' <*> many (notChar '"') <*> char '"'
str = undefined
notChar = undefined many
A concatenation is just the +
character followed by zero
or more terms:
concat = char '+' <*> many term
Printing is expressed with the string print
followed by
the term to print:
print = string "print" <*> term
= undefined string
Parsing should eventually give us a data structure called an Abstract Syntax Tree, which we then could process further. A Haskell type for the nodes of our tree would be:
data Term = Str String | Concat [Term] | Print Term
Now we need a way to convert our syntax to a tree of Terms. Let’s
make up another operator <$>
that converts a parsing
result to a data type of our choice:
<$>) = undefined
(
= Str <$> (char '"' <*> many (notChar '"') <*> char '“')
str concat = Concat <$> (char '+' <*> many term)
print = Print <$> (string "print" <*> term)
If you follow the imaginary types of these expressions, you’ll notice
that if our expressions would be parsers that produced the data they
parsed, the types would actually match. Expect that our sequential
composition would produce too much results. For example, in parsing
print
we wouldn’t actually be interested in receiving the
text print
as long as it’s present in the syntax, we’d only
be interested in the right hand side of the composition. Let’s solve
this by defining two variants for our sequential composition, which
ignore one side and only return the other:
-- "ignoring left and taking right"
*> b = undefined
a
-- "taking left and ignoring right
<* b = undefined a
Now we can improve our definitions:
= char '"' *> many (notChar '"') <* char '"'
str concat = char '+' *> many term
print = string "print" *> term
= char '(' *> t <* char ')' inparens t
In the final syntax, the terms are often separated by white space. One way to handle this would be to define that a term can always starts with some white space:
= space *> (str <|> inparens (concat <|> print))
term = many $ char ' ' space
To actually be able to parse a string of characters and produce
something else, we’ll need to implement the most primitive of our
undefineds, namely char
. For this we need to think about
what our Parser would actually be like. One definition is something
which takes a string of characters and turns it into a list of things
and remaining characters:
-- "a parser for things is a parser from strings to list of things and strings"
newtype Parser thing = Parser { parse :: String -> [(thing,String)] }
Now we can define a primitive parser which either accepts or rejects a single character based on a given predicate, which can be used to implement the parsers accepting a single character:
satisfy :: (Char -> Bool) -> Parser Char
pred = Parser $ \case
satisfy :xs | pred x -> [(x,xs)]
x-> []
_
= satisfy . (==)
char = satisfy . (/=) notChar
At this point we have actually implemented the whole logic to
actually parse our toy language. And we haven’t used a parsing library,
monads or applicatives or pretty much anything! The only thing missing
is a couple of undefineds: our five operators <|>
,
<*>
, <*
, *>
,
<$>
as well as two combinators many
and
string
.
We could implements these ourselves, but if we take a close look at the names and semantics, we might recognize these as functions from Functor and Applicative. Let’s see if we can use these fundamental abstractions to implement the remaining pieces.
First we need a Functor instance, which we can derive:
import qualified Data.Functor as F
deriving instance F.Functor Parser
Then instances for Applicative
and
Alternative
, which we have to write manually. (See One more thing for a way to automatically
derive these):
import qualified Control.Applicative as A
instance A.Applicative Parser where
pure x = Parser $ \input -> [(x, input)]
Parser af <*> Parser aa = Parser $ \input ->
| (f, input1) <- af input, (a, input2) <- aa input1]
[(f a, input2)
instance A.Alternative Parser where
= Parser $ \_ -> []
empty Parser p) <|> (Parser q) = Parser $ \input ->
(case p input of
-> q input
[] -> r r
And now the missing implementations are available from the
Functor
and Applicative
modules:
<$>) = (F.<$>)
(<|>) = (A.<|>)
(<*> b = a A.<*> b
a *>) = (A.*>)
(<*) = (A.<*)
(= A.many many
The remaining string
we'd have to implement ourselves,
or use Traversable:
import qualified Data.Traversable as T
= T.traverse char string
Now we can implement the main parsing function and derive a
Show
instance to get a printable AST out:
deriving instance Show Term
= case parse term s of
parseProgram s "")] -> Just t
[(t,-> Nothing _
which actually works!
> parseProgram "(print (+ \"Hello \" \"World!\"))"
-- Just (Print (Concat [Str "Hello ",Str "World!"]))
This is Applicative Parsing.
The Applicative
interface (together with
Alternative
and Functor
) happens to provide
most what is needed to perform parsing. As the Applicative
is a really abstract and general purpose interface, it really makes me
wonder why combinatorial applicative parsing libraries aren't more
popular through various programming languages.
I mentioned in the beginning that many existing monadic parser combinator libraries implement the applicative interface. This means, that if (or when) I would like to utilize a performant, battle-tested parser implementation providing nice error messages with line numbers, instead of this kind of self made junk, I can do it pretty much by just modifying the import statements. For example, to make this example work with MegaParsec (6.x), I can do it like this:
{-# LANGUAGE NoImplicitPrelude, FlexibleContexts, DeriveFunctor #-}
module ParserUsingMegaParsec where
import Prelude (String,($),(.),(/=),Show,show,putStrLn)
import Data.Functor ((<$>))
import Control.Applicative (many,(<|>),(*>),(<*))
import Control.Applicative.Combinators (between)
import Text.Megaparsec (parse)
import Text.Megaparsec.String (Parser)
import Text.Megaparsec.Char (char,notChar,space,satisfy,string)
data Term = Str String | Concat [Term] | Print Term
deriving Show
term :: Parser Term
= space *> (str <|> inparens (concat <|> print))
term
= Str <$> (char '"' *> many (notChar '"') <* char '"')
str concat = Concat <$> (char '+' *> many term)
print = Print <$> (string "print" *> term)
= between (char '(') (char ')') inparens
This is the whole implementation.
The big point is, that in order to do parsing, you don’t actually
need to learn a parser generator or a parser combinator library. You
only need to learn Functor
and Applicative
interfaces, which should already be (but unfortunately are not) taught
in every university program related to software development.
Monads
are needed only when building a context
sensitive parser, where a step requires some information from
previous steps. Using the Applicative
interface leaves (at
least in theory) more optimisation possibilities for the implementation.
If you want to build an understanding of monads in general, I’d
recommend browsing through my own presentations
since unfortunately, I believe, most Monad
tutorials are
missing the point even more than I am ;)
One more thing
Recently I ran into a related
blog post, which defined a way to reduce code by deriving the
Applicative
instances:
import qualified Control.Monad.Trans.State.Strict as ST
newtype Parser thing = Parser { parse :: ST.StateT String Maybe thing }
deriving (Functor, Applicative, Alternative)
= ST.runStateT . parse
parseProgram
pred = Parser . ST.StateT $ \case
satisfy :xs | pred x -> pure (x,xs)
x-> empty _
Please leave any questions and suggestions in the comments! Especially if you think I have completely missed some point, and have a chance to learn something :)
The (compiling and runnable) code examples are available in Github: https://github.com/jyrimatti/app-parsing