Blog

Formula Parsing with Scala III: Project Set-Up

Formula Parsing with Scala III: Project Set-Up

Project Repository

You can find the repository for the complete tutorial on GitHub:
https://github.com/argodis/scala-parser/.

For each major milestone in this tutorial we have defined a separate branch. The master branch obviosuly corresponds to the final state of the project.

The code for this chapter lives in the setup branch of the repository.

Scala Project Set-Up

Tooling

This tutorial has been developed using IntelliJ IDEA with Scala support. The build tool is SBT, so we need to set up a few plugins and library dependencies that are required for development.

SBT Plugin Configuration

We need a few plugins in order to enable some Scala options such as language features, stricter syntax check, etc. They are defined in project/plugins.sbt as follows:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
addSbtPlugin("org.wartremover" % "sbt-wartremover" % "2.2.1")

Feel free to update the release versions if necessary.

Library Dependencies

The library versions are defined centrally in project/Version.scala in order to avoid duplication of some multi-package dependency definitions. We will start only with 2 libraries and add more when required:

libraryDependencies ++= Seq (
  "com.beust" % "jcommander" % Version.jcommander,
  "org.scala-lang.modules" %% "scala-parser-combinators" % Version.parsercombinators,
  "org.scalatest" %% "scalatest" % Version.scalatest % "test"
)

Initial Application Code

Adding the Application Object

Before we start working on the core functionality, we need to be able to handle the three main files that our clients are going use. To this end we add the following command-line options:

Name Description Default Location
--input Location of the input file ./src/test/resources/data.csv
--formula Location of the formula file ./src/main/resources/formula.csv
--output Where to write the output file ./output.csv

If the options are not specified on the command line, the application will try to us the defaults that are specified in the Arguments.scala file:

case class Arguments (
  @Parameter(names = Array("--input"), required = false, description = "Location of input data file")
  var input: String = "src/test/resources/data.csv",
  @Parameter(names = Array("--formula"), required = false, description = "Formula configuration file")
  var formula: String = "src/main/resources/formula.csv",
  @Parameter(names = Array("--output"), required = false, description = "Output data file")
  var output: String = "output.csv"
)

Having set up the argument logic, we are now ready to add the application object:

import com.beust.jcommander.JCommander

object FormulaParserApp {

  def parseCliArguments(args: Array[String]): Arguments = {
    val arguments = Arguments()
    JCommander
      .newBuilder()
      .addObject(arguments)
      .build()
      .parse(args.toArray: _*)
    // Return the parsed arguments
    arguments
  }

  def main(args: Array[String]): Unit = {
    val arguments = parseCliArguments(args)
    println(s"input: ${arguments.input}")
    println(s"formula: ${arguments.formula}")
    println(s"output: ${arguments.output}")
  }
}

Next, we are going to provide a few functions for loading the data.

Loading Input Data and Formula Configuration

As we have seen above, our input data file is located in src/test/resources/data.csv by default. It has been generated randomly using https://mockaroo.com/, so it is not necessarily logically consistent, but we don't need that in this tutorial. In principle you can load the data with your favorite csv library, here we are going to provide a solution based on the jackson csv library. This means that we need to add a few more library dependencies:

  "com.fasterxml.jackson.core" % "jackson-databind" % Version.jackson,
  "com.fasterxml.jackson.dataformat" % "jackson-dataformat-csv" % Version.jackson,
  "com.fasterxml.jackson.module" %% "jackson-module-scala" % Version.jackson,

The next item that we need is a case class that represents a row in the csv file, together with the correct data type for each field. In this case we use doubles for the floating point values. Please note that the parser will later convert the integer values to double as well.

case class InputDataRow (
  vin: String,
  readout_date: String,
  brand: String,
  model: String,
  mileage: Int,
  coolant_temperature: Double,
  oil_pressure: Double,
  ignition_cycles: Int,
  tyre_pressure: Double
)

Since we are interested in loading the formula configuration from a CSV, we might as well define the row format for that data type:

case class FormulaRow (
  id: Long,
  formula: String
)

Let's encapsulate the data read/write operations in a separate module called Data:

import java.io.{File, FileInputStream}

import com.fasterxml.jackson.dataformat.csv.{CsvMapper, CsvSchema}
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import de.argodis.tutorial.scalaparser.schema.{FormulaRow, InputDataRow}

import scala.io.Source
import scala.reflect.ClassTag
import scala.util.Try
import scala.collection.JavaConverters._

object Data {

  // This is a generic schema that reads the column names from the CSV header
  private val headerSchema: CsvSchema = CsvSchema
    .emptySchema
    .withHeader
    .withColumnSeparator(',')
    .withoutQuoteChar()

    def loadFileContent(filePath: String): Try[String] = {
      Try(Source
        .fromInputStream(new FileInputStream(new File(filePath)), "UTF-8")
        .getLines()
        .mkString("\n"))
    }

  def loadCsv[T](path: String)(implicit ct: ClassTag[T]): Try[List[T]] = {
    val mapper = new CsvMapper with ScalaObjectMapper
    mapper.registerModule(DefaultScalaModule)
    loadFileContent(path).map{content =>
      mapper
        .readerFor(ct.runtimeClass)
        .`with`(headerSchema)
        .readValues[T](content)
        .asInstanceOf[java.util.Iterator[T]]
        .asScala
        .toList
    }
  }

  def loadInput(path: String): Try[List[InputDataRow]] = loadCsv[InputDataRow](path)
  def loadFormula(path: String): Try[List[FormulaRow]] = loadCsv[FormulaRow](path)
}

As you can see, we provide two functions, Data.loadInput and Data.loadFormula that load CSV files and convert the data to a list of the appropriate row type. In addition, we have added a few basic tests in src/test/scala/TestData.scala:

class TestData extends FunSuite with Matchers {

  private val INPUT_DATA_FILE: String = "src/test/resources/data.csv"
  private val FORMULA_FILE: String = "src/main/resources/formula.csv"

  // We are not going to focus too much on this
  test("an attempt to load a file from a wrong path should result in a failure") {
    Data.loadFileContent("./non-existing-file") shouldBe a [Failure[_]]
  }

  test("loading a non-empty file successfully should result in a non-empty string") {
    Data.loadFileContent(INPUT_DATA_FILE).map(content => content should not be "")
  }

  //
  test("attempting to load a non-existing file should result in an empty sequence") {
    Data.loadCsv[String]("./non-existing-file") shouldBe a [Failure[_]]
  }

  test("loading a non-empty file successfully should result in a non-empty list of rows") {
    Data.loadCsv[String](INPUT_DATA_FILE).map(list => list.size should not be 0)
  }

  test("can load the input data from the resource folder") {
    Data.loadCsv[InputDataRow](INPUT_DATA_FILE).map(list => list.size shouldBe 1000)
  }

  test("can load the formula configuration file from the resource folder") {
    Data.loadCsv[FormulaRow](FORMULA_FILE).map(list => list.size shouldBe 3)
  }
}

Summary

This concludes the preparation of the project. We have created and configured our Scala/SBT project, implemented some initial functionality for loading data, and added a few tests. We are going to start implementing the actual parsing logic in the next chapter.

Continue on to the Lexer component.

0 Comments 0 Comments
0 Comments 0 Comments