Headline

A parser combinator library in Haskell

Illustration

Parsec-based parsers are built from parser combinators. For instance, the following trivial expressions denote parser for a digit or a letter, respectively. Such parsers for character classes are readily provided by Parsec:

digit

letter

The following expression denotes a parser for a non-empty sequence of digits; the many1 combinator corresponds essentially to "+" in EBNF notation for context-free grammars:

many1 digit

We will look at other combinators shortly, but let us first run the composed parsers. Parsec provides a runP function. For instance, we can attempt to parse a digit:

> runP digit () "" "1"
Right '1'

The input string for the parser digit is "1". The remaining arguments resolve some parameterization of Parsec which we skip here. The run returns the successfully parsed character in the right summand of an either type; the left operand is reserved for error handling. We see an unsuccessful parse, indeed, in the next example:

> runP digit () "" "x"
Left (line 1, column 1):
unexpected "x"
expecting digit

That is, we receive an error message with line and column information about the discrepancy between actual and expected input. Clearly, the input "x" cannot be parsed as a digit. Let us also run the parser for non-empty sequences on a few inputs:

> runP (many1 digit) () "" "42"
Right "42"
> runP (many1 digit) () "" "42x"
Right "42"
> runP (many1 digit) () "" "x42"
Left (line 1, column 1):
unexpected "x"
expecting digit

The first is successful because the input string "42" is exactly a sequence of digits. The second parse is also successful because the input string "42x" does at least have a sequence of digits as a prefix. The third parse fails because it does not start with a non-empty sequence of digits.

We can compose parser sequentially and by choice:

> runP (letter >> digit) () "" "a1"
Right "1"
> runP (many1 (letter <|> digit)) () "" "a1"
Right "a1"

The first parser parses a sequence of a letter and a digit. The second parser parses any non-empty sequence of letters or digits ("<|>"). Consider the parse tree returned for the first parsers. It is evident that the first component of the sequence does not contribute to the resulting parse tree. This is because the simple form of sequential composition (">>") indeed ignores the result of the first operand. We would need to leverage a more complex form of sequential composition (">>=") to explicitly capture the intermediate results for both operands and return their composition. Thus:

> runP (letter >>= \l -> digit >>= \d -> return [l,d]) () "" "a1"
Right "a1"

This form of sequential composition passes the result from the first operand to the second so that the latter can capture the result with a lambda. We also see how the sequential composition is finished off with a trivial parser with simply returns a value. We can also use the value-passing form of sequential composition to improve the earlier example of a parser for a digit sequence such that the parser returns an actual int rather than a list of characters:

> runP (many1 digit >>= \s -> return (read s :: Int)) () "" "42"
Right 42

That is, we compose many1 digit with a function which converts the parsed string to an int and returns it as the final result. The function return is also a parser combinator, which is used when a given value should be returned as opposed to invoking an actual parser on the input. (If you are familiar with monads, then you realize that Parsec leverages a monad with its operations return, ">>", and ">>=" for parsing, but if you are not aware of monads, then this should not be any problem.)

In the most general case, parsers are of this polymorphic type:

data ParsecT s u m a

The type parameters serve these roles:

  • s: the stream type for the input
  • u: a type for user state, e.g., for a symbol table
  • m: an extra monad to add effects to parsing
  • a: the type of the parse tree
When actual parsing does not involve any underlying monad, then the identity monad is used:

type Parsec s u = ParsecT s u Identity

In simple applications of Parsec, the stream type is String]] and no user state is used. This results in the following simplification; we also provide a simplified variation on runP:

type Parsec' = Parsec String ()

Here is another sample parser. It models names as they are similarly defined in many language syntax. That is, names should start with a letter and proceed with any number of letters or digits:

name :: Parsec' String
name = letter
       >>= \l -> many (letter <|> digit)
       >>= return . (l:)

The interesting bit is how we (need to) compose the initial letter with the remaining sequence. That is, we need to "cons up" the first letter with the remaining sequence. For instance:

> runP' name "a42 b88"
Right "a42"

See Contribution:haskellParsec for an illustration of using Parsec.


johanneshaertel edited this article at Fri, 30 Jun 2017 15:33:10 +0200
Compare revisions Compare revisions

User contributions

    This user never has never made submissions.

    User edits

    Syntax for editing wiki

    For you are available next options:

    will make text bold.

    will make text italic.

    will make text underlined.

    will make text striked.

    will allow you to paste code headline into the page.

    will allow you to link into the page.

    will allow you to paste code with syntax highlight into the page. You will need to define used programming language.

    will allow you to paste image into the page.

    is list with bullets.

    is list with numbers.

    will allow your to insert slideshare presentation into the page. You need to copy link to presentation and insert it as parameter in this tag.

    will allow your to insert youtube video into the page. You need to copy link to youtube page with video and insert it as parameter in this tag.

    will allow your to insert code snippets from @worker.

    Syntax for editing wiki

    For you are available next options:

    will make text bold.

    will make text italic.

    will make text underlined.

    will make text striked.

    will allow you to paste code headline into the page.

    will allow you to link into the page.

    will allow you to paste code with syntax highlight into the page. You will need to define used programming language.

    will allow you to paste image into the page.

    is list with bullets.

    is list with numbers.

    will allow your to insert slideshare presentation into the page. You need to copy link to presentation and insert it as parameter in this tag.

    will allow your to insert youtube video into the page. You need to copy link to youtube page with video and insert it as parameter in this tag.

    will allow your to insert code snippets from @worker.