### Introduction

In the previous post we created a lexer for our simplified formula syntax. Now we are going to dive into the core of the tutorial - building the actual parser - by defining and composing grammar rules that are used to convert the list of tokens `List[FormulaToken]` to an abstract syntax tree (AST). This is a data structure that represents the calculation expressed in the formula and can be evaluated recursively by executing the operations depicted by its nodes.

For the types of formulas we are considering, an AST is a very straightforward representation of an algebraic expression. Each of the operators `+`, `-`, `*`, `/` is a binary operator, acting on a left and right operand. Thus, we can view the operation it represents as a node in a tree and its operands as children of that node. For example, the formula `a + b` can be depicted as follows:

``````  +
/ \
a   b
``````

The same idea applies recursively to more complicated constructs, such as `(a + b) * (c - d)`:

``````     *
/   \
+     -
/ \   / \
a   b c   d
``````

Here we have used braces to delimit factors in the formula (braces don't need to be represented explicitly). The variables are always tree leaves, so our evaluation has to always start by computing the highest-level nodes first. We can imagine this process by reducing the tree sequentially as follows:

1. Initial Tree:
``````     *
/   \
+     -
/ \   / \
1   2 3   4
``````
1. Compute Level 2 Nodes:
``````     *
/ \
3  -1
``````
1. Compute Level 1 Nodes (root node)
``````-3
``````

Given this brief overview of the parsing process, we can proceed with the implementation next.

### Parser Implementation

We start by extending the `Parsers` class of the Scala parser combinators library. This is a more generic version compared to `RegexParsers` that we used in the lexer. Let us familiarize with the workings of this class before we start implementing the actual logic.

First, we need to read the tokens `List[FormulaToken]` that we have generated in the lexing step. Since the tokens are specific for our application, the first step is to create a token reader that understands these tokens:

``````class FormulaTokenReader(tokens: Seq[FormulaToken]) extends Reader[FormulaToken] {
def atEnd: Boolean = tokens.isEmpty
def pos: Position = NoPosition
}
``````

We can quickly check that the token reader works with a simple test:

``````class TestFormulaTokenReader extends FunSuite with Matchers {
val SIMPLE_FORMULA = "(\$1 + \$2) / (\$3 - \$4) + 100 * \$5"

FormulaLexer
.tokenize(SIMPLE_FORMULA)
}
}
``````

#### Formula Parser

The formula parser inherits from the generic `Parsers` class of the Scala combinators library, since we are dealing with our own tokens and need to define the parsing logic as well. An initial skeleton class definition:

``````object FormulaParser extends Parsers {
// This is required for the token reader
override type Elem = FormulaToken
}
``````
##### Defining Tree Nodes

We have mentioned several times that we would like to build an abstract syntax tree for a particular formula by parsing a list of tokens. This tree consists of several types of nodes that we have to define before using them in the parser. We need a base node type that we are going to call `FormulaAST`, because operator nodes have to hold references to their child operand nodes. Constants and variables are leave nodes and therefore only hold a value or an id. A summary of the various node types is given in the table below:

Operation / Data Type Node Name Left Operand Right Operand Value
Subtract OperatorSubtract FormulaAST FormulaAST -
Multiply OperatorMultiply FormulaAST FormulaAST -
Divide OperatorDivide FormulaAST FormulaAST -
Constant Value Constant - - Double
Variable Id Variable - - Int

How should we implement these in Scala? A straightforward translation to case classes can be found in `nodes.scala`:

``````sealed trait FormulaAST
case class Constant(value: Double) extends FormulaAST
case class Variable(value: Int) extends FormulaAST
case class OperatorAdd(left: FormulaAST, right: FormulaAST) extends FormulaAST
case class OperatorSubtract(left: FormulaAST, right: FormulaAST) extends FormulaAST
case class OperatorMultiply(left: FormulaAST, right: FormulaAST) extends FormulaAST
case class OperatorDivide(left: FormulaAST, right: FormulaAST) extends FormulaAST
``````
##### Handling Constants

We are going to build the formula grammar step by step. We start by handling a single constant in the formula expression, something like `-23.5`. No multiple constants will be allowed, just one. This means that the token list that we are going to process will always have be of the form `List(CONSTANT(number))]`. No other tokens are allowed for now, neither multiple constants like `-23.5 5.32 94`. Anything besides a single `CONSTANT(number)` in the expression will be considered a syntax error.

Let us start by converting the `CONSTANT(number)` token to a node `Constant(number)` that is actually a valid `FormulaAST` tree with a single root node. We consume the token explicitly by using the `accept` method as follows:

``````private def constant = accept("Constant", { case CONSTANT(value) => Constant(value) })
``````

In preparation of handling more complex grammars, we also define a top-level function that should produce a complete AST by consuming the whole list of tokens:

``````private def formula: Parser[FormulaAST] = phrase(constant)
``````

Finally, we just need to add a function that ties everything together by lexing the formula expression string, creating a token reader, and producing the formula syntax tree:

``````object FormulaParser extends Parsers {
// This is required for the token reader
override type Elem = FormulaToken

private def constant = accept("Constant", { case CONSTANT(value) => Constant(value) })
private def formula: Parser[FormulaAST] = phrase(constant)

def parse(formulaExpression: String): Either[String, FormulaAST] =
// Unfortunately ParseResult does not provide flatMap
FormulaLexer.tokenize(formulaExpression) match {
case FormulaLexer.Success(tokens, _) =>
case Success(formulaAST, _) => Right(formulaAST)
case Failure(msg, _) => Left(msg)
case Error(msg, _) => Left(msg)
}
case FormulaLexer.Failure(msg, _) => Left(msg)
case FormulaLexer.Error(msg, _) => Left(msg)
}
}
``````

Don't worry, we are going to improve on this monstrosity when we finish the first version of the parser. For now you can note that `ParseResult` does not implement a `flatMap` function, therefore we can not compose the lexer easily. Even worse, both `FormulaLexer` and `FormulaParser` define their own `Success`, `Failure`, and `Error` objects as you can see above.

In any case, let us test if the parser functions correctly:

``````import de.argodis.tutorial.scalaparser.parser.nodes.Constant
import de.argodis.tutorial.scalaparser.parser.FormulaParser
import org.scalatest.{FunSuite, Matchers}

class TestFormulaParser extends FunSuite with Matchers {
val SIMPLE_FORMULA = "(\$1 + \$2) / (\$3 - \$4) + 100 * \$5"

test("parser can produce a constant") {
FormulaParser
.parse("-23.5")
.map(formulaAST => formulaAST shouldBe Constant(-23.5))
}

test("parser should fail to parse expression '5 10'") {
FormulaParser.parse("5 10") shouldBe a ('Left)
}
}
``````

It looks good! We can now add more functionality by recognizing variables.

##### Handling Variables

The formula grammar rules can be extended to support a single constant or a variable in the formula expression quite easily. To do this, we need to make sure that we have the `Variable` node defined in `nodes.scala`:

``````case class Variable(value: Int) extends FormulaAST
``````

The new grammar rule in `FormulaParser` for the variable should be obvious:

``````private def variable = accept("Variable", { case VARIABLE(id) => Variable(id) })
``````

The last thing that we need to do is to update the top-level function so that it can recognize a constant or a variable as the root node:

``````private def formula: Parser[FormulaAST] = phrase(constant | variable)
``````

And now let's add some new tests:

``````  test("parser can produce a variable") {
FormulaParser
.parse("\$7")
.map(formulaAST => formulaAST shouldBe Variable(7))
}

test("parser should fail to parse expression '\$2 \$9'") {
FormulaParser.parse("\$2 \$9") shouldBe a ('Left)
}

test("parser should fail to parse expression '4.35 \$12'") {
FormulaParser.parse("4.35 \$12") shouldBe a ('Left)
}
``````
##### Handling One Operator

Now we are stepping into the fun part of the tutorial, because here we will have to start to think about "language design" of our formulas which takes into account operator precendence, etc.

First, let's define a more complex grammar for our formulas. We should be able to handle expressions of the form `a + b`, where `a` and `b` can be either constants or variables. Effectively this will create a parse tree of depth 1:

``````  +
/ \
a   b
``````

We would also like to build up on the previous section by still allowing a tree with a single node that can be either a constant or a variable:

`````` constant | variable
``````
###### Terminals

The first step would be to unify the grammar rule for handling constants and variables, because they will always be the leaf nodes of our ASTs. Let's invent a new term called 'terminals' and define the following rule:

``````private def terminal: Parser[FormulaAST] = constant | variable
``````

If we update the top-level grammar rule as follows

``````private def formula: Parser[FormulaAST] = phrase(terminal)
``````

then we should be able to run all tests successfully, so the case of a single node is now taken care of.

###### Operator Rule

Next, let's make sure that we have the addition operator defined in `nodes.scala`:

``````case class OperatorAdd(left: FormulaAST, right: FormulaAST) extends FormulaAST
``````

As you can see, the `add` operator is a binary operator on two expressions, left and right, so we need to reference these as subtrees of the `OperatorAdd` node. At this stage these expressions will simply be terminals, but later on we will be able to use as expressions more complicated constructs such as factors in the formula.

So how do we match a binary operator? The Scala parser combinators library gives us the `~` sequencing operator that can be used to match two sequential tokens as follows:

``````"a b" -> a ~ b
``````

In the case of the operator, we obviously have three tokens that we can map like this:

``````"a + b" -> a ~ + ~ b -> terminals ~ OperatorAdd ~ terminals
``````

So let's define our first operator grammar rule

``````  private def expression: Parser[FormulaAST] =
terminal ~ OPERATOR_ADD ~ terminal ^^ { case left ~ OPERATOR_ADD ~ right => OperatorAdd(left, right) }
``````

In order to use it, we need to add it to the top-level rule and run the tests:

``````// private def formula: Parser[FormulaAST] = phrase(expression | terminal)
private def formula: Parser[FormulaAST] = phrase(expression)
``````

The terminal single-node formulas are still getting recognized, so let's add some new tests for the new syntax:

``````test("parser can produce the expression 3.2 + 7.8") {
FormulaParser.parse("3.2 + 7.8") shouldBe Right(OperatorAdd(Constant(3.2), Constant(7.8)))
}

test("parser can produce the expression -0.75 + \$3") {
FormulaParser.parse("-0.75 + \$3") shouldBe Right(OperatorAdd(Constant(-0.75), Variable(3)))
}

test("parser can produce the expression \$8 + \$9") {
FormulaParser.parse("\$8 + \$9") shouldBe Right(OperatorAdd(Variable(8), Variable(9)))
}

test("parser should fail to parse the expression '\$1 + '") {
FormulaParser.parse("\$1 + ") shouldBe a ('Left)
}

test("parser should fail to parse the expression '+ \$3'") {
FormulaParser.parse("+ \$3") shouldBe a ('Left)
}
``````

Finally, let's add the subtraction operator for completeness:

``````case class OperatorSubtract(left: FormulaAST, right: FormulaAST) extends FormulaAST
``````
``````  private def expression: Parser[FormulaAST] =
terminal ~ (OPERATOR_ADD | OPERATOR_SUBTRACT) ~ terminal ^^ {
case left ~ OPERATOR_SUBTRACT ~ right => OperatorSubtract(left, right)
}
``````
``````  test("parser can produce the expression 3.2 - 7.8") {
FormulaParser.parse("3.2 - 7.8") shouldBe Right(OperatorSubtract(Constant(3.2), Constant(7.8)))
}

test("parser can produce the expression -0.75 - \$3") {
FormulaParser.parse("-0.75 - \$3") shouldBe Right(OperatorSubtract(Constant(-0.75), Variable(3)))
}

test("parser can produce the expression \$8 - \$9") {
FormulaParser.parse("\$8 - \$9") shouldBe Right(OperatorSubtract(Variable(8), Variable(9)))
}

test("parser should fail to parse the expression '\$1 - '") {
FormulaParser.parse("\$1 - ") shouldBe a ('Left)
}

test("parser should fail to parse the expression '- \$3'") {
FormulaParser.parse("- \$3") shouldBe a ('Left)
}
``````
##### Handling Multiple Operators

So far we have confined ourselves to expressions of type `a + b` that can be parsed into a simple tree like:

``````  +
/ \
a   b
``````

Now we need to extend this approach to more generic formulas that can have multiple operators, for instance: `a + b - c`. How does the corresponding parse tree look like? Since we are consuming from the left, we can imagine this also as `a + (b - c)` where `(b - c)` is a sub-expression. This corresponds to the following tree:

``````  +
/ \
a   -
/ \
b   c
``````

In order to match such expressions, we need to observe that any sub-parts of the complete expression are also expressions - terminals are expressions, as well as `(b-c)` from above. This means that an operator can take not only fixed values, but also expressions as its operands.

Generally, such constructs will require recursive definition of the corresponding grammar rule, but we need one more idea in order to implement it. Since we allow for an arbitrary length of the formula, we need to match on optional parts. Schematically, we can represent this idea as follows: `a Option(+ b)`. If there is a `+`, then we match the whole expression `a + b`. Otherwise, we just match `a`.

The corresponding Scala parser combinator operator is simply called `opt` and returns `Some(expression)` if it matches or `None` if it doesn't. With this information in hand, we can define a more generic recursive matcher as follows:

``````  private def expression: Parser[FormulaAST] =
terminal ~ opt((OPERATOR_ADD | OPERATOR_SUBTRACT) ~ expression) ^^ {
case left ~ None => left
case left ~ Some(operator ~ right) => operator match {
case OPERATOR_SUBTRACT => OperatorSubtract(left, right)
}
}
``````

This is a very general approach that we are going to follow when implementing more complex syntactic features, so let's understand this better:

The left operand can be generally an expression or a terminal. Notice, however, that in expressions like `a + b + c + d`, the left-most operand is always a terminal one because we process the right operand recursively. This can be easily understood if we use brackets to represent how the parser consumes the tokens: `a + (b + (c + d))`. This also allows us to quickly sketch the parse tree:

``````  +
/ \
a   +
/ \
b   +
/ \
c   d
``````

The termination condition of the recursion is when the optional matches fails when there is no operator.

Let's add a test to check our implementation:

``````test("parser produce the expression '\$1 + \$2 + \$3'") {
FormulaParser.parse("\$1 + \$2 + \$3") shouldBe Right(tree)
}
``````

Unfortunately, we encounter a small hiccup: so far all token objects inherit from the sealed trait `FormulaToken`. Our latest version of the expression tries to match both operators `(OPERATOR_ADD | OPERATOR_SUBTRACT)` but the subsequent pattern match potentially fails because it tries to match exhaustively all `FormulaToken` types. In order to avoid this problem, we are going to specialize the operators a bit, by building a new hierarchy and seal only the immediate supertype:

``````trait FormulaToken
case object BRACE_LEFT extends FormulaToken
case object BRACE_RIGHT extends FormulaToken
case class VARIABLE(id: Int) extends FormulaToken
case class CONSTANT(value: Double) extends FormulaToken
// Arithmetic Operators
trait ArithmeticOperator extends FormulaToken
// Sum Operators
sealed trait SumOperator extends ArithmeticOperator
case object OPERATOR_SUBTRACT extends SumOperator
// Product Operators
sealed trait ProductOperator extends ArithmeticOperator
case object OPERATOR_MULTIPLY extends ProductOperator
case object OPERATOR_DIVIDE extends ProductOperator
// Modulo Operator
// case object OPERATOR_MODULO extends ArithmeticOperator
``````

Let's finish up by adding a few more tests:

``````test("parser produce the expression '\$1 - \$2 - \$3'") {
val tree = OperatorSubtract(Variable(1), OperatorSubtract(Variable(2), Variable(3)))
FormulaParser.parse("\$1 - \$2 - \$3") shouldBe Right(tree)
}

test("parser produce the expression '\$1 + \$2 - \$3'") {
val tree = OperatorAdd(Variable(1), OperatorSubtract(Variable(2), Variable(3)))
FormulaParser.parse("\$1 + \$2 - \$3") shouldBe Right(tree)
}

test("parser should fail to parse the expression '\$1 + \$2 \$3'") {
FormulaParser.parse("\$1 + \$2 \$3") shouldBe a ('Left)
}

test("parser should fail to parse the expression '\$1 \$2 - \$3'") {
FormulaParser.parse("\$1 \$2 - \$3") shouldBe a ('Left)
}
``````
##### Dealing with Operator Precedence

The next piece of the puzzle is adding product operators to the mix, let's take the formula `a * b + c / d - e` as an example. Here we encounter the concept of operator precedence: if we just mix all four arithmetic operators (`+`, `-`, `*`, `/`), then the parser will naively process the above expression like this: `a * (b + (c / (d - e)))`. Obviously this is wrong, we don't want to divide `c` by `d - e`. The product operators need to be evaluated first. The following parse tree corresponds to the correct data structure:

``````       +
/ \
/   \
/     -
/     / \
*    [/]  e
/ \   / \
a   b c   d
``````

So how do we prioritize the product operations? As you have seen so far, we are always trying to match expressions. Remember that we are trying to match sum operators as follows:

``````terminal ~ operator matcher ~ expression
``````

If the left operand is a terminal (variable or a constant), the match is successful, otherwise we recursively try to match any remaning sum operators until we have consumed the whole phrase. As a tree data structure, this looks like:

``````    operator
/        \
terminal expression
``````

If we compare this graphical representation with the tree for the expression `a * b + c / d - e`, then we can recognize that we can substitute `terminal` with the example multiplication as follows:

``````   operator
/       \
*         \
/ \         \
a   b     expression
``````

This leads us to the idea of creating an expression of the type (pseudocode):

``````product expression ~ sum operators ~ sum expression
``````

where we label the matcher above `sum expression` so that we can reference it recursively.

We can obviously need to split it up into the following two expressions:

``````product expression ~ sum operators ~ sum expression
terminals ~ product operators ~ product expression
``````

We can also think of the whole process as the following sequence, where we omit the recursive steps:

``````match formula -> match sum operators -> match product operators -> match terminals
``````

First, let's add the new operator matcher

``````private def operator_product: Parser[FormulaAST] =
terminal ~ opt((OPERATOR_MULTIPLY | OPERATOR_DIVIDE) ~ operator_product) ^^ {
case left ~ None => left
case left ~ Some(operator ~ right) => operator match {
case OPERATOR_MULTIPLY => OperatorMultiply(left, right)
case OPERATOR_DIVIDE => OperatorDivide(left, right)
}
}
``````

We can now update and rename `private def expression: Parser[FormulaAST]` as:

``````private def operator_sum: Parser[FormulaAST] =
operator_product ~ opt((OPERATOR_ADD | OPERATOR_SUBTRACT) ~ operator_sum) ^^ {
case left ~ None => left
case left ~ Some(operator ~ right) => operator match {
case OPERATOR_SUBTRACT => OperatorSubtract(left, right)
}
}
``````

Finally, we can update the top-level matching function:

``````private def formula: Parser[FormulaAST] = phrase(operator_sum)
``````

Since we are building on the existing syntax, all tests so far should be successful, but we need to add new ones for the multiplication:

``````  test("parser should produce the expression '\$1 * \$2' ") {
val tree = OperatorMultiply(Variable(1), Variable(2))
FormulaParser.parse("\$1 * \$2") shouldBe Right(tree)
}

test("parser should produce the expression '\$1 / \$2' ") {
val tree = OperatorDivide(Variable(1), Variable(2))
FormulaParser.parse("\$1 / \$2") shouldBe Right(tree)
}

test("parser should produce the expression '\$1 * \$2 + \$3' ") {
val tree = OperatorAdd(OperatorMultiply(Variable(1), Variable(2)), Variable(3))
FormulaParser.parse("\$1 * \$2 + \$3") shouldBe Right(tree)
}

test("parser should produce the expression '\$1 - \$2 / \$3' ") {
val tree = OperatorSubtract(Variable(1), OperatorDivide(Variable(2), Variable(3)))
FormulaParser.parse("\$1 - \$2 / \$3") shouldBe Right(tree)
}

test("parser should produce the expression '\$1 * \$2 + \$3 / \$4 - \$5' ") {
val mul = OperatorMultiply(Variable(1), Variable(2))
val div = OperatorDivide(Variable(3), Variable(4))
val tree = OperatorAdd(mul, OperatorSubtract(div, Variable(5)))
FormulaParser.parse("\$1 * \$2 + \$3 / \$4 - \$5") shouldBe Right(tree)
}
``````
##### Handling Factors

The final missing element are factors that are delimited by braces, such as `(\$1 + \$2)` and `(\$3 - \$4)` in the formula `(\$1 + \$2) * (\$3 - \$4)`. Following our arithmetic intuition, these factors have to be recognized and evaluated as operands before applying the operator that combines the results of the evaluation.

Before we discuss how to implement this feature, let us discuss how to match braces. We start by observing that braces are not needed in the actual syntax tree, so we only have to recognize them, but we don't need to convert them to tree nodes in the AST:

``````(1 + 2) * (3 - 4)
``````
``````     *
/ \
/   \
+     -
/ \   / \
1   2 3   4
``````

The Scala parser combinator library provides us with special matching operators `~>` and `<~` that are well-suited for exactly this purpose:

``````L ~> e <~ R
``````

If the full expression `L e R` is matched, then tokens `L` and `R` are consumed and discarded by the parser, it only transforms the middle expression `e` to a `Parser[T]` object:

``````def expression: Parser[FormulaAST] = L ~> e <~ R ^^ { e => something }
``````

In our case the factor is denoted by braces and the inside expression still needs to be evaluated, so we have to match it by the lowest-priority operator matcher as follows:

``````private def factor: Parser[FormulaAST] = BRACE_LEFT ~> operator_sum <~ BRACE_RIGHT
``````

The expression that `operator_sum` matches can be one of the following:

• Sum expression of the form left ~ sum operator ~ right
• Product expression of the form left ~ product operator ~ right
• Factor of the form `( ~ expression ~ )`
• Terminal of the form `\$n` or `number`

It will be successively resolved by the matches that `operator_product` performs.

Finally, we need to think where to fit this factor expression. Remember that the highest priority operator matcher looks for terminals on the left side:

``````private def operator_product: Parser[FormulaAST] =
terminal ~ opt((OPERATOR_MULTIPLY | OPERATOR_DIVIDE) ~ operator_product) ^^ ...
``````

The operands can now be either terminals or factors, so we just need to add this rule to the existing matcher:

``````  private def operator_product: Parser[FormulaAST] =
(factor | terminal) ~ opt((OPERATOR_MULTIPLY | OPERATOR_DIVIDE) ~ operator_product) ^^ {
case left ~ None => left
case left ~ Some(operator ~ right) => operator match {
case OPERATOR_MULTIPLY => OperatorMultiply(left, right)
case OPERATOR_DIVIDE => OperatorDivide(left, right)
}
}
``````

This is it! We have now completely defined our arithmetic formula parser! Let's add a few more tests just to be sure:

``````  test("parser should produce the expression '(1)'") {
FormulaParser.parse("(1)") shouldBe Right(Constant(1))
}

test("parser should produce the expression '((1))'") {
FormulaParser.parse("((1))") shouldBe Right(Constant(1))
}

test("parser should produce the expression '(\$1)'") {
FormulaParser.parse("(\$1)") shouldBe Right(Variable(1))
}

test("parser should produce the expression '(\$1 + \$2)'") {
FormulaParser.parse("(\$1 + \$2)") shouldBe Right(tree)
}

test("parser should produce the expression '(\$1 * \$2)'") {
val tree = OperatorMultiply(Variable(1), Variable(2))
FormulaParser.parse("(\$1 * \$2)") shouldBe Right(tree)

}

test("parser should produce the expression '(\$1 + \$2) + (\$3 + \$4)'") {
FormulaParser.parse("(\$1 + \$2) + (\$3 + \$4)") shouldBe Right(tree)
}

test("parser should produce the expression '(\$1 + 10) * (\$2 - 4.25)' ") {
val factor2 = OperatorSubtract(Variable(2), Constant(4.25))
val tree = OperatorMultiply(factor1, factor2)
FormulaParser.parse("(\$1 + 10) * (\$2 - 4.25)") shouldBe Right(tree)
}

test("parser should produce the expression '((\$1 + 2) * (\$3 + 4)) / (\$5 + 6)' ") {