Blog

Formula Parsing with Scala VII: Tying Everything Together

Formula Parsing with Scala VII: Tying Everything Together

Introduction

We have now completed the core of the application, but in order to use it, we still have to stich the different parts together. We will also take this opportunity to smooth out some rough edges in the current implementation. Although we are not going to go deep into advanced functional programming, we would like to sketch basic ideas that will allow for a more elegant solution. You shouldn't need prior knowledge of the underlying abstract concepts to follow the rest of the tutorial.

Computational Model

While it is quite clear what to do, there are many different ways to implement the processing steps. In our case we will try to follow the monadic model of computation that we can represent as follows:

read input data ----> error
       |
       |
       v
read formulas  -----> error
       |
       |
       v
parse formulas -----> error
       |
       |
       v
evaluate formulas --> error
       |
       |
       v
save output --------> error
       |
       |
       v
     success

What this means is that we define the sequence of computations in such a way that it will break automatically if an error occurs. In this way we don't need to write error handling code for each step, we will "collect" the errors and handle them in once place.

Side Note

If you are not familiar with for expressions in Scala, this is just some syntactic sugar that converts a series of nested flatMap and map calls to a sequence in a for block. The following example should clarify this idea:

val A: Option[Int] = Some(2)
val B: Option[Int] = Some(3)
val result = A.flatMap(a => B.map(b => a * b))
result == Some(6)

It looks much simpler if we rewrite it like this:

val result = for {
 a <- A
 b <- B
} yield a * b

Implementation Idea

From the above example it is clear that we need a container type that can propagate possible errors occuring at any step in the computation sequence. Which standard Scala type is best suitable for the job? Option is not suitable, because it can't hold additional information like an error message. Try also doesn't quite fit, because we have built our parsing logic without throwing exceptions. We are left with Either, so we can decide to specialize it like Either[String, T], in order to be able to output error messages to the user.

An initial sketch of the program then is going to look somewhat like this:

val result = for {
  input <- Data.loadInput(arguments.input)
  formulas <- Data.loadFormula(arguments.formula)
  trees <- parseFormulas(formulas)
  output <- process(input, trees)
  msg <- Data.writeOutput(arguments.output, output)
} yield msg

This compositional style has the added advantage that we can build up the solution step-by-step, without needing the full sequence of computations as shown above. Before we define the missing functions above, we have to make sure that everything we have developed so far is compatible with Either[String, T], as we show next.

House Cleaning

We can start updating the main function by considering the csv loading functions only:

def main(args: Array[String]): Unit = {
  val arguments = parseCliArguments(args)
  val result = for {
    input <- Data.loadInput(arguments.input)
    formulas <- Data.loadFormula(arguments.formula)
  } yield formulas
}

It is still possible to compose them, because both return Try, as we shall see next.

I/O Functions

When we started, we created functions that load csv files and convert the data to the corresponding row format of the input data / formula definition rows:

def loadInput(path: String): Try[List[InputDataRow]] = loadCsv[InputDataRow](path)
def loadFormula(path: String): Try[List[FormulaRow]] = loadCsv[FormulaRow](path)

Let us first observe that if the functions fail due to a wrong file path, permission problems, corrupted files, etc., then the only thing we need to do is output the corresponding error message. We don't need to handle the full error object. This means that we can freely convert Try to Either[String, T] as follows:

def loadInput(path: String): Either[String, List[InputDataRow]] = loadCsv[InputDataRow](path)
def loadFormula(path: String): Either[String, List[FormulaRow]] = loadCsv[FormulaRow](path)

Looks like we need to update the loadCsv function as well. Luckily for us, Try has a convenience method toEither that we are going to use. Unfortunately, we note that it returns Either[Throwable, T]. This means that we have to convert it to Either[String, T] explicitly:

.toEither match {
  case Left(error) => Left(error.getMessage)
  case Right(msg) => Right(msg)
}

There could be a more elegant way to solve this, but we are willing to make a compromise for comprehisibility's sake. The full listing of loadCsv looks like this:

def loadCsv[T](path: String)(implicit ct: ClassTag[T]): Either[String, List[T]] = {
  val mapper = new CsvMapper with ScalaObjectMapper
  mapper.registerModule(DefaultScalaModule)
  loadFileContent(path).map{content =>
    mapper
      .readerFor(ct.runtimeClass)
      .`with`(headerSchema)
      .readValues[T](content)
      .asInstanceOf[java.util.Iterator[T]]
      .asScala
      .toList
  }
}.toEither match {
  case Left(error) => Left(error.getMessage)
  case Right(msg) => Right(msg)
}

In order to proceed, we will define a stub for the FormulaParserApp.process function that for now will serve as a 'glue' between the input and output only:

def process(input: List[InputDataRow], trees: List[FormulaRow]): Either[String, List[String]] = Right(List())

With this change, we can update the main function as follows:

def main(args: Array[String]): Unit = {
  val arguments = parseCliArguments(args)
  val result = for {
    input <- Data.loadInput(arguments.input)
    formulas <- Data.loadFormula(arguments.formula)
  } yield process(input, formulas)

  result match {
    case Left(errorMessage) => println(s"Error: $errorMessage")
    case Right(msg) => println(s"Success: $msg")
  }
}

We are still missing a write function for the output data, so let's add one:

def writeOutput(path: String, data: List[String]): Either[String, String] = Try {
  val bw = new BufferedWriter(new FileWriter(new File(path)))
  bw.write(data.mkString("\n"))
  bw.close()
  s"Wrote ${data.size} lines"
}.toEither match {
  case Left(error) => Left(error.getMessage )
  case Right(msg) => Right(msg)
}

Since this represents the final stage of the computation, we may as well update the main function:

val result = for {
  input <- Data.loadInput(arguments.input)
  formulas <- Data.loadFormula(arguments.formula)
  output <- process(input, formulas)
  msg <- Data.writeOutput(arguments.output, output)
} yield msg

Now that we have defined the beginning and the end of this simple computational pipeline, let us focus on tidying up some earlier code.

Parsing All Formulas

The critical step in this tutorial is obviously the parsing process. We will start by defining a stub for the parsing function based on a compatible signature:

def parse(formulas: List[FormulaRow]): Either[String, List[(Long, FormulaAST)]] = Right(List())

In order to feed the resulting formula ASTs to the processing function, it should accept this type as well:

def process(input: List[InputDataRow], trees: List[(Long, FormulaAST)]): Either[String, List[String]] = Right(List())

With this, we are now ready to declare the full computational pipeline:

val result = for {
  input <- Data.loadInput(arguments.input)
  formulas <- Data.loadFormula(arguments.formula)
  trees <- parse(formulas)
  output <- process(input, trees)
  msg <- Data.writeOutput(arguments.output, output)
} yield msg
Parse Step

What we need to do next is obviously iterate over the elements of List[FormulaRow] and parse each formula:

val trees = formulas.map(row => FormulaParser.parse(row.formula))

In addition, we also need the formula ids for the evaluation step, so we can extend the above line to:

val trees = formulas.map(row => FormulaParser.parse(row.formula).map(ast => (row.id, ast)))

Let's stop for a moment and look back at the FormulaParser.parse function, did we forget something?

def parse(formulaExpression: String): Either[String, FormulaAST] =
  // Unfortunately ParseResult does not provide flatMap
  FormulaLexer.tokenize(formulaExpression) match {
    case FormulaLexer.Success(tokens, _) =>
      formula(FormulaTokenReader(tokens.asInstanceOf[List[FormulaToken]])) match {
        case Success(formulaAST, _) => Right(formulaAST)
        case Failure(msg, _) => Left(msg)
        case Error(msg, _) => Left(msg)
      }
    case FormulaLexer.Failure(msg, _) => Left(msg)
    case FormulaLexer.Error(msg, _) => Left(msg)
  }

This doesn't look like the most elegant code ever written. The reason is that the Scala parser combinator library returns a ParseResult[T] object that doesn't have a flatMap function. We did not consider composition at the time, but now we could improve on this.

First, take a look at tokenize function of FormulaLexer:

def tokenize(formula: String): ParseResult[List[FormulaToken]] = parse(tokens, formula)

We have to convert it to def tokenize(formula: String): Either[String, List[FormulaToken]], so let's match the different types that ParseResult can return:

  def tokenize(formula: String): Either[String, List[FormulaToken]] = parse(tokens, formula) match {
    case Success(tokens, _ ) => Right(tokens)
    case Failure(msg, _) => Left(msg)
    case Error(msg, _) => Left(msg)
  }

With this we can also update the FormulaParser.parse function that now looks like this:

def parse(formulaExpression: String): Either[String, FormulaAST] =
  FormulaLexer
    .tokenize(formulaExpression)
    .flatMap(tokens => formula(FormulaTokenReader(tokens)) match {
      case Success(ast, _) => Right(ast)
      case Failure(msg, _) => Left(msg)
      case Error(msg, _) => Left(msg)
    })

Now that we can produce a list of parsed formulas in the form List[Either[(Long, FormulaAST)]], we need to make a design decision. How do we handle formulas that fail to parse? If a formula is invalid (syntax error), it will produce Left[String], but we can still retain a mix of errors and correctly parsed formulas even when multiple formulas in the configuration are invalid. Do we ignore the failed formulas and how do we notify the user?

Since the definition of the formulas is not the job of our application, we can now decide not to process the data if we encounter faulty formulas. This means that the application will have to terminate with an error message about the syntax error when the first such case is encountered. If all formulas are correct, then we proceed to compute the results.

Up until now we have automatically handled the error case by supplying Either[String, T] as the return type of the individual computational steps. But the parsing of the formulas occurs inside the FormulaParserApp.parse function, where we have defined the perfectly valid list of results val trees: List[Either[String]]'. We need to do some extra work to look for errors and pass the first one as the result of theFormulaParserApp.parse` function.

Scala collections provide a very convenient function to get the first element of the type we are looking for in a List:

val errorOption: Option[String] = trees collectFirst { case Left(error) => error }

We can also collect all correctly parsed formulas as follows:

val correctASTs = trees.collect { case Right(r) => r}

The only thing that remains to do is to convert errorOption into an Either[String, List[]] that the FormulaParserApp.parse function can return:

val result: Either[String, List[(Long, FormulaAST)]] = errorOption.toLeft(correctASTs)

If errorOption is filled, toLeft will return Left(errorMessage), on the other hand it will return toRight(List(...)) if errorOption is None.

We can now finalize the parse function by sorting the list of formula syntax trees as follows:

  def parse(formulas: List[FormulaRow]): Either[String, List[(Long, FormulaAST)]] = {
    val trees = formulas.map(row => FormulaParser.parse(row.formula).map(ast => (row.id, ast)))
    trees
      .collectFirst { case Left(error) => error }
      .toLeft(trees.collect {case Right(r) => r}.sortBy(_._1))
  }

Now if there is a bad formula definition anywhere in the formula configuration file, the program is going to output an error message and halt.

As a final touch, we might want to clarify the error messages a bit in the FormulaLexer.tokenizer

case Failure(msg, input) => Left(s"Lexer Error: '$msg'. Input: $input")
case Error(msg, input) => Left(s"Lexer Error: '$msg'. Input $input")

and FormulaParser.parse functions:

case Failure(msg, input) => Left(s"Parser Error: '$msg'. Input: $input")
case Error(msg, input) => Left(s"Parser Error: '$msg'. Input: $input")

Formula Evaluation

We have come a long away, but we still need to complete the final computational step: evaluating the parsed formulas for every row of input data. We have already stubbed the FormulaParserApp.process function, now let's think about the design.

First, we have to convert the InputDataRow object to a compatible Map[Int, Double] that the FormulaEvaluator.evaluate function accepts. The most obvious way to do this is to define a member function InputDataRow.toMap of the InputDataRow class. If we take a look at the data, we will notice that not all fields have numerical values:

vin readout_date brand model mileage coolant_temperature oil_pressure ignition_cycles tyre_pressure
1GKS1AECXFR441593 24.04.2018 Volkswagen Eurovan 173888 29.91 45.04 813 3.78
SCBBR93W678635158 17.07.2010 Suzuki X-90 29902 92.99 12.18 665 2.35
WAUSG78E26A094153 04.01.2001 Hyundai Equus 150406 107.78 22.49 2550 1.4

Since our numbering scheme for the fields starts from the beginning:

field number
vin 1
readout_date 2
brand 3
model 4
mileage 5
coolant_temperature 6
oil_pressure 7
ignition_cycles 8
tyre_pressure 9

we can see that we should convert only the numerical fields:

  def inputMap: Map[Int, Double] = Map(
    5 -> mileage.toDouble,
    6 -> coolant_temperature,
    7 -> oil_pressure,
    8 -> ignition_cycles.toDouble,
    9 -> tyre_pressure
  )

With this detail out of the way, we can now think about the format of the output row. As we have explained in the beginning, we are going to use the formula ids to define the columns of our output table. Each row in the output table will contain formula results calculated from the corresponding row in the input table:

1 2 3 ... n
row 1 formula 1 row 1 formula 2 row 1 formula 3 ... row 1 formula n
... ... ... ... ...
row n formula 1 row n formula 2 row n formula 3 ... row n formula n

Let us start by creating the csv header. This is just a matter of joining the formula ids that we already have in the formula list:

val header = formulas.map(_._1.toString).mkString(",")

Next, we can evaluate each formula for the given row input:

val output = input.map(row =>
  formulas.map(formulaPair => FormulaEvaluator.evaluate(formulaPair._2, row.inputMap).mkString).mkString(",")
)

Note that we convert the Option[Double] result of FormulaEvaluator.evaluate to a String in such a way, that we are going to produce an empty string if the option is empty.

Finally, we can stitch together the process function as follows:

def process(input: List[InputDataRow], formulas: List[(Long, FormulaAST)]): Either[String, List[String]] = {
  val header = formulas.map(_._1.toString).mkString(",")
  val output = input.map(row =>
    formulas.map(formulaPair => FormulaEvaluator.evaluate(formulaPair._2, row.inputMap).mkString).mkString(",")
  )
  Right(header :: output)
}

Summary

Congratulations! We have now a stand-alone command line application that can evaluate arbitrary algebraic expressions for arbitrary tabular input. The knowledge that we have gained in the past chapters culminated in designing a functional computational pipeline that is defined in a succint manner and handles failure automatically. This, however, just defines a basic example for what is possible to achieve with a custom parser. It can get really interesting when thinking about designing more programming constructs such as logical expressions, functions and more! These fancy features are discussed in the final chapter of this tutorial.

Continue on to Getting Fancy.

0 Comments 0 Comments
0 Comments 0 Comments