Blog

Formula Parsing with Scala IV: Lexer

Formula Parsing with Scala IV: Lexer

Introduction

In order to be able to use the formula expressions from the configuration file, there are basically 3 steps that we need to perform:

  1. Tokenize the formula expression string into syntactic elements - lexing.
  2. Create a tree data structure (AST) that represents the operations in their proper order - parsing.
  3. Use the AST tree to calculate the result for some input data - evaluation.

In this part of the tutorial we are going to concentrate on the formula lexer, so we can convert the string that contains the expression to a list of tokens. In order to do that, we will first think about what syntactic elements we need, then define Scala traits and regular expressions that we will use to extract the tokens from the formula string.

Formula Syntax

As we discussed in the use case definition, we are going to implement a simplified syntax for the formula expressions before getting fancy. As soon as we have a complete solution and we understand all intricacies of the problem, we can scale up to more functionality such as custom functions, handling dates, etc.

So what exactly is the formula syntax? Let's take a look at an example first:

($1 + $2) / ($3 - $4) + 100 * $5

If we break the above expression down, we can classify the various elements of the formula as follows:

Token Type Description Symbol
left brace brace start of expression (
right brace brace end of expression )
variable variable variable $n where n is an integer
constant literal can be a positive/negative integer/double integer/double
add operator addition operator +
subtract operator subtraction operator -
multiply operator multiplication operator *
divide operator division operator \

Tokens

We can start our analysis by observing that each element of the formula has a specific function and can either be always represented by the same symbol (braces, algebraic operators) or by different symbols (variables and constants like $1, -0.25). We call these elements tokens and the job of the lexer is to convert the formula definition string to a sequence of token objects.

How shall we represent the tokens in Scala? Based on the above observation, we can have either:

  1. Tokens that represent only one kind of object (singletons if you wish)
  2. Tokens that can take different values

This matches well with Scala's concept of classes and objects, which leads us straight to the following definitions in parser/tokens/tokens.scala:

Token Definitions in Scala
sealed trait FormulaToken
case object BRACE_LEFT extends FormulaToken
case object BRACE_RIGHT extends FormulaToken
case class VARIABLE(id: Int) extends FormulaToken
case class CONSTANT(value: Double) extends FormulaToken
case object OPERATOR_ADD extends FormulaToken
case object OPERATOR_SUBTRACT extends FormulaToken
case object OPERATOR_MULTIPLY extends FormulaToken
case object OPERATOR_DIVIDE extends FormulaToken

Please note that we are going to convert all literals to doubles for simplicity.

Lexer Implementation

Background on the Scala Combinator Library

Now we are ready to start implementing the lexer component. The Scala parser combinator library provides us with a framework for implementing various types of parsers using composable rules. It provides a wide range of functionality - from simple word tokenizers to complex regular expression-based parsers. In this case we are going to start by defining regular expressions for the individual tokens, then we are going to combine them using standard operators supplied by the library.

Let us define our Lexer object by inheriting from RegexParser:

object FormulaLexer extends RegexParser
Mapping Symbols to Tokens

Before we proceed with the implementation, we are going to make sure that we understand how to map the symbols in the formula expression to tokens of type FormulaToken. As an example, we will take the above formula ($1 + $2) / ($3 - $4) + 100 * $5 and convert it into a list of type List[FormulaToken], that is why we also give the list index in the first column below:

List Index Symbols Token
0 ( BRACE_LEFT
1 $1 VARIABLE(1)
2 + OPERATOR_ADD
3 $2 VARIABBLE(2)
4 ) BRACE_RIGHT
5 / OPERATOR_DIVIDE
6 ( BRACE_LEFT
7 $3 VARIABLE(3)
8 - OPERATOR_SUBTRACT
9 $4 VARIABLE(4)
10 ) BRACE_RIGHT
11 + OPERATOR_ADD
12 100 CONSTANT(100.0
13 * OPERATOR_MULTIPLY
14 $5 VARIABLE(5)
Test Case

Now that we know how to produce our token list, we can specify a test case that we are going to use to verify our implementation:

import de.argodis.tutorial.scalaparser.parser.FormulaLexer
import de.argodis.tutorial.scalaparser.parser.tokens._
import org.scalatest.{FunSuite, Matchers}

class TestFormulaLexer extends FunSuite with Matchers {
  val SIMPLE_FORMULA = "($1 + $2) / ($3 - $4) + 100 * $5"
  val SIMPLE_FORMULA_TOKENS = List(
    BRACE_LEFT, VARIABLE(1), OPERATOR_ADD, VARIABLE(2), BRACE_RIGHT, OPERATOR_DIVIDE,
    BRACE_LEFT, VARIABLE(3), OPERATOR_SUBTRACT, VARIABLE(4), BRACE_RIGHT,
    OPERATOR_ADD, CONSTANT(100.0), OPERATOR_MULTIPLY, VARIABLE(5)
  )

  test("formula is tokenized correctly") {
    FormulaLexer
      .tokenize(SIMPLE_FORMULA)
      .map(tokens => tokens shouldBe SIMPLE_FORMULA_TOKENS)
  }
}
Applying the Lexer

In order to use our newly defined FormulaLexer, we need to somehow call the parse function that we inherit from RegexParser. It takes a top-level combinator that we are going to call tokens and the formula expression string as input:

parse[T](tokens: Parser[T], formula: String): ParseResult[T]

The return type of the parse function is a special ParseResult[T] object that contains either the result of the tokenization or an error and can be matched as follows:

parseResult match {
  case Success(tokens, input) => {}
  case Failure(msg, input) => {}
  case Error(msg, input) => {}
}

The actual tokenization result type (T in ParseResult[T]) depends on the return type of the top-level combinator tokens. In our case we would like to produce a list List[FormulaToken] of FormulaToken objects, therefore its signature should look like this:

def tokens: Parser[List[FormulaToken]]

Finally, we are in position to add the function tokenize that will serve as the interface to the lexer functionality that simply takes the input string and returns a list of FormulaToken objects:

object FormulaLexer extends RegexParsers {
  private def tokens: Parser[List[FormulaToken]] = ???
  // Main lexer function to call in other components
  def tokenize(formula: String): ParseResult[List[FormulaToken]] = parse(tokens, formula)
}

In the following sections we are going to flesh out the individual rules and overall tokenization logic so that we can complete the implementation.

Individual Token Matching
Structure of the Matching Functions

Now we can start defining rules for matching individual tokens and converting them to Scala objects. Before we proceed, let's take a look at an example token matching function:

def word: Parser[String] = """[a-z]+""".r ^^ { strToken => strToken.toString }

First, notice that it doesn't take any explicit arguments and returns an object of type Parser[T]. This allows us to combine specialized parsers into more generic ones like this: word1 | word2. The operator | means that the resulting parser can match either word1 or word2. There are a few more operators like this that we will need to implement all matching rules for the lexer. Note that we are going to refer to functions like word as 'matchers' in this tutorial, but this is not an official term.

Second, let's take a look at the body of the function word. It contains a regular expression """[a-z]+""".r that is matched against its input (implicitly by the higher-order RegexParsrs.parse function). If the match is successful, then we can apply the function strToken => strToken.toString to the extracted strToken with the ^^ operator. This basically converts the string characters that define the token into a Scala object of another type representing the token (in this simple example we just return String). Note that the regular expression can also contain groups, so we don't need to always pass the whole matched string, but also parts of it.

Example Matching Functions

Now we can apply what we have just learned to the two main cases that we need to cover:

  1. Tokens that are always represented by the same symbol(s).
  2. Tokens that are represented by varying symbols.

Some of the most basic elements of the formula syntax are the round braces ( and ). Here is an example matching function for extracting and converting a left brace from a larger expression:

def brace_left: Parser[FormulaToken] = "(" ^^ ( _ => BRACE_LEFT )

The left brace is always represented by the same symbol (, so we always return a BRACE_LEFT token whenever we encounter this symbol in a formula expression.

As a final example, let's define a function that converts the token $n to a case class instance VARIABLE(n). If we apply our current ideas directly, we get something like this:

def variable: Parser[FormulaToken] = """\$(\d+)""" ^^ {
  val pattern = """\$(\d+)""".r
  strToken => strToken.toString match {
    case pattern(strNumber) => VARIABLE(strNumber.toInt)
  }
}

However, there should be a better way, this is Scala after all! We need one more idea, namely a way to decompose the token $n into partial tokens $ and n. We will discuss more ways to do that in the next section, but for now let us present the ~> operator:

def compositeToken: Parser[T] = token1 ~> token2

It matches a composite token of type token1token2 and returns token2. It allows us to define the variable token in a more elegant way:

def variable: Parser[FormulaToken] = 
  "$" ~> """\d+""".r ^^ {id => VARIABLE(id.toInt)}

Note that $ and """\d+""" are actually automatically recognized and we don't have to define them as separate functions.

Parser Combinators

As we have seen above, we need to define a top-level function that combines our various token matching functions (parsers) to process the full input string. To this end, the Scala parser combinator library provides a slew of convenience functions, with the most important for us being:

  • | - The 'or' operator can be used to try to match different parsers on the same token p1 | p2 if we don't know exactly what kind of token we have.
  • rep1[T](p: Parser[T]): Parser[List[T]] - Applies parser p repeatedly until it fails. It has to match at least once to be successful.
  • phrase[T](p: Parser[T]): Parser[T] - Applies the parser p to all tokens of the given string. It succeeds if there is no more input left unmatched.

Using these combinators, we can define our top-level combinator as follows:

def tokens: Parser[List[FormulaToken]] =
  phrase(
    rep1(
      brace_left
        | variable
        | constant
        | brace_right
        | operator_add
        | operator_subtract
        | operator_multiply
        | operator_divide
  )) ^^ { rawTokens => rawTokens }

It consumes the whole phrase, i.e. the whole formula string and tries to match the composite parser passed to rep1 at least one time. This means that if we can not match any of the tokens above (unknown syntactic element), the lexer will fail, as it should. Since the job of the lexer is to only produce a list of tokens, it can not guarantee that the expression is valid (something like ($1 + $2( will be tokenized successfully). Here is a listing of the complete lexer component:

import scala.util.matching.Regex
import scala.util.parsing.combinator.RegexParsers
import de.argodis.tutorial.scalaparser.parser.tokens._

object FormulaLexer extends RegexParsers {
  override def skipWhitespace = true
  override val whiteSpace: Regex = "[ \t\r\f]+".r

  // Braces
  def brace_left: Parser[FormulaToken] = "(" ^^ { _ => BRACE_LEFT }
  def brace_right: Parser[FormulaToken] = ")" ^^ { _ => BRACE_RIGHT }
  // Algebraic operators
  def operator_multiply: Parser[FormulaToken] = "*" ^^ { _ => OPERATOR_MULTIPLY }
  def operator_divide: Parser[FormulaToken] = "/" ^^ { _ => OPERATOR_DIVIDE }
  def operator_add: Parser[FormulaToken] = "+" ^^ { _ => OPERATOR_ADD }
  def operator_subtract: Parser[FormulaToken] = "-" ^^ { _ => OPERATOR_SUBTRACT }
  // Variable
  def variable: Parser[FormulaToken] = "$" ~> """\d+""".r ^^ { id => VARIABLE(id.toInt) }
  // Constant
  def constant: Parser[FormulaToken] = {
    """-?(\d+(\.\d*)?|\d*\.\d+)([eE][+-]?\d+)?[fFdD]?""".r ^^ { value => CONSTANT(value.toDouble) }
  }

  def tokens: Parser[List[FormulaToken]] =
    phrase(
      rep1(
        brace_left
          | variable
          | constant
          | brace_right
          | operator_add
          | operator_subtract
          | operator_multiply
          | operator_divide
    )) ^^ { rawTokens => rawTokens }

  def tokenize(formula: String): ParseResult[List[FormulaToken]] = parse(tokens, formula)
}

Summary

In this part we have figured out how to convert the raw formula definition string into a list of formula tokens. We have also implemented our lexer component FormulaLexer using the Scala parser combinator library that provided us with a framework for defining specific token matching rules with regular expressions and combining them together into a single tokenizer solution. In the next section we are going to learn how to convert the list of tokens List[FormulaToken] into an abstract syntax tree (AST).

Continue on to the Parser Component.

0 Comments 0 Comments
0 Comments 0 Comments