### Introduction

In order to be able to use the formula expressions from the configuration file, there are basically 3 steps that we need to perform:

- Tokenize the formula expression string into syntactic elements -
**lexing**. - Create a tree data structure (AST) that represents the operations in their proper order -
**parsing**. - Use the AST tree to calculate the result for some input data -
**evaluation**.

In this part of the tutorial we are going to concentrate on the formula lexer, so we can convert the string that contains the expression to a list of tokens. In order to do that, we will first think about what syntactic elements we need, then define Scala traits and regular expressions that we will use to extract the tokens from the formula string.

### Formula Syntax

As we discussed in the use case definition, we are going to implement a simplified syntax for the formula expressions before getting fancy. As soon as we have a complete solution and we understand all intricacies of the problem, we can scale up to more functionality such as custom functions, handling dates, etc.

So what exactly is the formula syntax? Let's take a look at an example first:

```
($1 + $2) / ($3 - $4) + 100 * $5
```

If we break the above expression down, we can classify the various elements of the formula as follows:

Token | Type | Description | Symbol |
---|---|---|---|

left brace | brace | start of expression | ( |

right brace | brace | end of expression | ) |

variable | variable | variable | $n where n is an integer |

constant | literal | can be a positive/negative integer/double | integer/double |

add | operator | addition operator | + |

subtract | operator | subtraction operator | - |

multiply | operator | multiplication operator | * |

divide | operator | division operator | \ |

#### Tokens

We can start our analysis by observing that each element of the formula has a specific function and can either be always represented by the same symbol (braces, algebraic operators) or by different symbols (variables and constants like `$1`

, `-0.25`

). We call these elements *tokens* and the job of the lexer is to convert the formula definition string to a sequence of token objects.

How shall we represent the tokens in Scala? Based on the above observation, we can have either:

- Tokens that represent only one kind of object (singletons if you wish)
- Tokens that can take different values

This matches well with Scala's concept of classes and objects, which leads us straight to the following definitions in `parser/tokens/tokens.scala`

:

##### Token Definitions in Scala

```
sealed trait FormulaToken
case object BRACE_LEFT extends FormulaToken
case object BRACE_RIGHT extends FormulaToken
case class VARIABLE(id: Int) extends FormulaToken
case class CONSTANT(value: Double) extends FormulaToken
case object OPERATOR_ADD extends FormulaToken
case object OPERATOR_SUBTRACT extends FormulaToken
case object OPERATOR_MULTIPLY extends FormulaToken
case object OPERATOR_DIVIDE extends FormulaToken
```

Please note that we are going to convert all literals to doubles for simplicity.

#### Lexer Implementation

##### Background on the Scala Combinator Library

Now we are ready to start implementing the lexer component. The Scala parser combinator library provides us with a framework for implementing various types of parsers using composable rules. It provides a wide range of functionality - from simple word tokenizers to complex regular expression-based parsers. In this case we are going to start by defining regular expressions for the individual tokens, then we are going to combine them using standard operators supplied by the library.

Let us define our Lexer object by inheriting from `RegexParser`

:

```
object FormulaLexer extends RegexParser
```

##### Mapping Symbols to Tokens

Before we proceed with the implementation, we are going to make sure that we understand how to map the symbols in the formula expression to tokens of type `FormulaToken`

. As an example, we will take the above formula `($1 + $2) / ($3 - $4) + 100 * $5`

and convert it into a list of type `List[FormulaToken]`

, that is why we also give the list index in the first column below:

List Index | Symbols | Token |
---|---|---|

0 | ( | BRACE_LEFT |

1 | $1 | VARIABLE(1) |

2 | + | OPERATOR_ADD |

3 | $2 | VARIABBLE(2) |

4 | ) | BRACE_RIGHT |

5 | / | OPERATOR_DIVIDE |

6 | ( | BRACE_LEFT |

7 | $3 | VARIABLE(3) |

8 | - | OPERATOR_SUBTRACT |

9 | $4 | VARIABLE(4) |

10 | ) | BRACE_RIGHT |

11 | + | OPERATOR_ADD |

12 | 100 | CONSTANT(100.0 |

13 | * | OPERATOR_MULTIPLY |

14 | $5 | VARIABLE(5) |

##### Test Case

Now that we know how to produce our token list, we can specify a test case that we are going to use to verify our implementation:

```
import de.argodis.tutorial.scalaparser.parser.FormulaLexer
import de.argodis.tutorial.scalaparser.parser.tokens._
import org.scalatest.{FunSuite, Matchers}
class TestFormulaLexer extends FunSuite with Matchers {
val SIMPLE_FORMULA = "($1 + $2) / ($3 - $4) + 100 * $5"
val SIMPLE_FORMULA_TOKENS = List(
BRACE_LEFT, VARIABLE(1), OPERATOR_ADD, VARIABLE(2), BRACE_RIGHT, OPERATOR_DIVIDE,
BRACE_LEFT, VARIABLE(3), OPERATOR_SUBTRACT, VARIABLE(4), BRACE_RIGHT,
OPERATOR_ADD, CONSTANT(100.0), OPERATOR_MULTIPLY, VARIABLE(5)
)
test("formula is tokenized correctly") {
FormulaLexer
.tokenize(SIMPLE_FORMULA)
.map(tokens => tokens shouldBe SIMPLE_FORMULA_TOKENS)
}
}
```

##### Applying the Lexer

In order to use our newly defined `FormulaLexer`

, we need to somehow call the `parse`

function that we inherit from `RegexParser`

. It takes a top-level combinator that we are going to call `tokens`

and the formula expression string as input:

```
parse[T](tokens: Parser[T], formula: String): ParseResult[T]
```

The return type of the `parse`

function is a special `ParseResult[T]`

object that contains either the result of the tokenization or an error and can be matched as follows:

```
parseResult match {
case Success(tokens, input) => {}
case Failure(msg, input) => {}
case Error(msg, input) => {}
}
```

The actual tokenization result type (`T`

in `ParseResult[T]`

) depends on the return type of the top-level combinator `tokens`

. In our case we would like to produce a list `List[FormulaToken]`

of `FormulaToken`

objects, therefore its signature should look like this:

```
def tokens: Parser[List[FormulaToken]]
```

Finally, we are in position to add the function `tokenize`

that will serve as the interface to the lexer functionality that simply takes the input string and returns a list of `FormulaToken`

objects:

```
object FormulaLexer extends RegexParsers {
private def tokens: Parser[List[FormulaToken]] = ???
// Main lexer function to call in other components
def tokenize(formula: String): ParseResult[List[FormulaToken]] = parse(tokens, formula)
}
```

In the following sections we are going to flesh out the individual rules and overall tokenization logic so that we can complete the implementation.

##### Individual Token Matching

###### Structure of the Matching Functions

Now we can start defining rules for matching individual tokens and converting them to Scala objects. Before we proceed, let's take a look at an example token matching function:

```
def word: Parser[String] = """[a-z]+""".r ^^ { strToken => strToken.toString }
```

First, notice that it doesn't take any explicit arguments and returns an object of type `Parser[T]`

. This allows us to combine specialized parsers into more generic ones like this: `word1 | word2`

. The operator `|`

means that the resulting parser can match either `word1`

or `word2`

. There are a few more operators like this that we will need to implement all matching rules for the lexer. Note that we are going to refer to functions like `word`

as 'matchers' in this tutorial, but this is not an official term.

Second, let's take a look at the body of the function `word`

. It contains a regular expression `"""[a-z]+""".r`

that is matched against its input (implicitly by the higher-order `RegexParsrs.parse`

function). If the match is successful, then we can apply the function `strToken => strToken.toString`

to the extracted `strToken`

with the `^^`

operator. This basically converts the string characters that define the token into a Scala object of another type representing the token (in this simple example we just return String). Note that the regular expression can also contain groups, so we don't need to always pass the whole matched string, but also parts of it.

###### Example Matching Functions

Now we can apply what we have just learned to the two main cases that we need to cover:

- Tokens that are always represented by the same symbol(s).
- Tokens that are represented by varying symbols.

Some of the most basic elements of the formula syntax are the round braces `(`

and `)`

. Here is an example matching function for extracting and converting a left brace from a larger expression:

```
def brace_left: Parser[FormulaToken] = "(" ^^ ( _ => BRACE_LEFT )
```

The left brace is always represented by the same symbol `(`

, so we always return a `BRACE_LEFT`

token whenever we encounter this symbol in a formula expression.

As a final example, let's define a function that converts the token `$n`

to a case class instance `VARIABLE(n)`

. If we apply our current ideas directly, we get something like this:

```
def variable: Parser[FormulaToken] = """\$(\d+)""" ^^ {
val pattern = """\$(\d+)""".r
strToken => strToken.toString match {
case pattern(strNumber) => VARIABLE(strNumber.toInt)
}
}
```

However, there should be a better way, this is Scala after all! We need one more idea, namely a way to decompose the token `$n`

into partial tokens `$`

and `n`

. We will discuss more ways to do that in the next section, but for now let us present the `~>`

operator:

```
def compositeToken: Parser[T] = token1 ~> token2
```

It matches a composite token of type `token1token2`

and returns `token2`

. It allows us to define the variable token in a more elegant way:

```
def variable: Parser[FormulaToken] =
"$" ~> """\d+""".r ^^ {id => VARIABLE(id.toInt)}
```

Note that `$`

and `"""\d+"""`

are actually automatically recognized and we don't have to define them as separate functions.

#### Parser Combinators

As we have seen above, we need to define a top-level function that combines our various token matching functions (parsers) to process the full input string. To this end, the Scala parser combinator library provides a slew of convenience functions, with the most important for us being:

`|`

- The 'or' operator can be used to try to match different parsers on the same token`p1 | p2`

if we don't know exactly what kind of token we have.`rep1[T](p: Parser[T]): Parser[List[T]]`

- Applies parser`p`

repeatedly until it fails. It has to match at least once to be successful.`phrase[T](p: Parser[T]): Parser[T]`

- Applies the parser p to all tokens of the given string. It succeeds if there is no more input left unmatched.

Using these combinators, we can define our top-level combinator as follows:

```
def tokens: Parser[List[FormulaToken]] =
phrase(
rep1(
brace_left
| variable
| constant
| brace_right
| operator_add
| operator_subtract
| operator_multiply
| operator_divide
)) ^^ { rawTokens => rawTokens }
```

It consumes the whole *phrase*, i.e. the whole formula string and tries to match the composite parser passed to rep1 at least one time. This means that if we can not match any of the tokens above (unknown syntactic element), the lexer will fail, as it should. Since the job of the lexer is to only produce a list of tokens, it can not guarantee that the expression is valid (something like `($1 + $2(`

will be tokenized successfully). Here is a listing of the complete lexer component:

```
import scala.util.matching.Regex
import scala.util.parsing.combinator.RegexParsers
import de.argodis.tutorial.scalaparser.parser.tokens._
object FormulaLexer extends RegexParsers {
override def skipWhitespace = true
override val whiteSpace: Regex = "[ \t\r\f]+".r
// Braces
def brace_left: Parser[FormulaToken] = "(" ^^ { _ => BRACE_LEFT }
def brace_right: Parser[FormulaToken] = ")" ^^ { _ => BRACE_RIGHT }
// Algebraic operators
def operator_multiply: Parser[FormulaToken] = "*" ^^ { _ => OPERATOR_MULTIPLY }
def operator_divide: Parser[FormulaToken] = "/" ^^ { _ => OPERATOR_DIVIDE }
def operator_add: Parser[FormulaToken] = "+" ^^ { _ => OPERATOR_ADD }
def operator_subtract: Parser[FormulaToken] = "-" ^^ { _ => OPERATOR_SUBTRACT }
// Variable
def variable: Parser[FormulaToken] = "$" ~> """\d+""".r ^^ { id => VARIABLE(id.toInt) }
// Constant
def constant: Parser[FormulaToken] = {
"""-?(\d+(\.\d*)?|\d*\.\d+)([eE][+-]?\d+)?[fFdD]?""".r ^^ { value => CONSTANT(value.toDouble) }
}
def tokens: Parser[List[FormulaToken]] =
phrase(
rep1(
brace_left
| variable
| constant
| brace_right
| operator_add
| operator_subtract
| operator_multiply
| operator_divide
)) ^^ { rawTokens => rawTokens }
def tokenize(formula: String): ParseResult[List[FormulaToken]] = parse(tokens, formula)
}
```

### Summary

In this part we have figured out how to convert the raw formula definition string into a list of formula tokens. We have also implemented our lexer component `FormulaLexer`

using the Scala parser combinator library that provided us with a framework for defining specific token matching rules with regular expressions and combining them together into a single tokenizer solution. In the next section we are going to learn how to convert the list of tokens `List[FormulaToken]`

into an abstract syntax tree (AST).