Blog

Formula Parsing with Scala II: Use Cases and Problem Definition

Formula Parsing with Scala II: Use Cases and Problem Definition

Introduction

In this chapter we are going to quickly look at a few cases where custom text parsing capability is required and we are also going to define a very specific problem that we would like to solve using this technology. This will put us on course for developing a self-contained command-line application that implements the business case described here.

Use Cases for Parsing

Any non-trivial software project will sooner or later require some sort of configuration options, especially if it is user-facing and requires some flexibility. There are usually three ways to provide configurability, which are often constructed in a hierarchical manner (e.g. cli arguments override configuration file options):

  1. Configuration Files
  2. Environment Variables
  3. Command-Line Arguments

In addition, there exist several standard formats for organizing configuration files, such as XML, JSON, or YAML. For most application these are quite sufficient and can express tree-based configuration logic quite well. There are times, however, when standard solutions just don't work.

As an example, take a look at the configuration of the Nginx Web Server:

user  nginx;
worker_processes  1;

error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;

events {
  ...
}

http {
  ...
}

Here we have a non-standard format that allows for custom configuration syntax, expression evaluation, variable substitution, etc. A web server is a complex piece of software and therefore standard solutions don't apply here.

A similar situation can occur even in simpler cases where we simply need to evaluate configuration expressions such as (car OR pickup) AND (truck OR bus) or (bicycle OR motorcycle) AND driver.

In this case we are given some (grammar) rules for combining different entities and expressions, but there is no fixed schema to how the configuration should be written. Potentially there could be an infinite number of expressions that are all valid according to these rules. The only way to process them is to create a program that can understand and evaluate them according to the grammar.

In general you might encounter a situation where it is easier to describe the configuration entities and rules using a custom syntax, because it will make the whole process much easier, understandable, and less error-prone. If you take that step, then you will have to build a parser for evaluating the configuraiton. This tutorial is meant to give you some directions on how to do this with the Scala parser combinator library.

Problem Definition

Before we start coding, we need to define the problem that we would like to solve in the first place. We will approach the solution step by step where individual milestones are described in separate chapters.

In a nutshell, we are given some input data in a tabular format. We also have a configuration file that contains a list of formula expressions that we would like to calculate using the input data. All formulas are applied to each row of the input data, thus giving us tabular output that consists of the same number of rows as the input data and where each column contains the formula result for that particular row.

Input Data

Let us imagine that we are working in the automobile industry and we have just received an input data set that contains diagnostic information from multiple different vehicles over a longer time period (several years). The data is stored in a tabular format where each row represents information obtained from the vehicle when it has visited a service center for regular maintenance or repair, such as:

Short VIN Readout Date Brand Model Build Year Mileage Break Pads Oil Pressure
A028050 02.05.2010 BMW 3 Series 2018 150043 60 11.34
F82456 11.08.2017 Audi A4 2010 89232 30 22.82
2462797 28.12.2013 Volkswagen Polo 2011 65923 74 34.32
F162873 07.03.2014 Mercedes-Benz C Class 2017 12984 40 38.42
J180146 08.09.2018 Opel Astra 2013 81234 13 15.93

After (somehow) calculating the results, we store them in an output format that we discuss next.

Output Data

Our clients need KPIs that are derived from the raw diagnostic data, so we have to compute them using the input table. This means that the results for each formula will be represented by the column "KPI #n" in the output table, so that it has the same number of rows as the input data:

KPI 1 KPI 2 KPI 3 KPI 4
23.42 94.23 23 -12.35
923.91 10.53 120 59.45
-523.43 0.56 34.54
54.10 37.43 89 100.43
0.85 84.32 78 58.32

Formula Configuration

Let's assume that there is a large number of KPIs and the formulas that define how they are calculated are changing frequently. In this scenario it is infeasible to hard-code the computations, since we will have to change the source code of our application and redeploy it each time when such a change occurs. Clearly this is going to be impractical due to a huge development cost and potential for errors.

The interesting part in this problem is how to design a flexible configuration format for the formulas (KPIs) that can be updated without changing the source code of the application. In summary we need to be able to:

  • assign a unique id to each formula expression
  • add or remove new formula expressions
  • update existing formula expressions

In addition, the output has to be generated automatically, based solely on the provided data and formula configuration, without any additional user input.

The obvious approach (if we stick to a tabular format) is to simply define two columns in our configuration file - one for the formula id and one for the formula definition as follows:

Formula Id Formula Expression
1 $6 * ($7 + $8) / 1000
2 0.5 * $8
3 10 * exp(- $7) / ($6 + $8)
4 date($1) + 10
5 age($5 - today)

In order to keep everything simple, we are going to use integer formula ids, so when producing the output data set, we are going to name the columns like this:

1 2 3 4
23.42 94.23 23 -12.35
923.91 10.53 120 59.45
-523.43 0.56 34.54
54.10 37.43 89 100.43
0.85 84.32 78 58.32

Once we understand the formula syntax better, we can apply arbitrary changes to the configuration file, since the information here is sufficient to automatically calculate the results.

Formula Syntax

Now let us focus on how the formulas are defined. Take for example the expression ($6 + $7) * (($8 - $6) / ($7 + $8)).

First, notice that the variables ($n) in the expression refer to columns in the input data. We have defined a particular format for these variables in order to distinguish them from other constituents of the expressions, such as constants, function names, etc. In this case $1 means that we read the value from column 1 of the input data. Remember that we process one row of input data at a time, which means that we can only use values from the columns in that particular row.

What other building blocks do we need to completely define the formula syntax? Before doing that, we have to decide what features we actually need. For simplicity's sake we are going to allow arithmetic operations only (we will add more features in the last chapter of this tutorial). Besides variables we are also going to allow for constants that can be defined as literals (integer or double). A summary of this syntax is presented below:

  • Parentheses
  • Algebraic operators such as +, -, *, /
  • Constants (Literals)

In the next chapter we are going to prepare our Scala project before starting with the development.

Continue to Project Set-Up.

0 Comments 0 Comments
0 Comments 0 Comments