Categories
Big Data

Writing a faster memory efficient lookup table in scala

TL;DR

This blog post explains how to build a fast, memory efficient data lookup from a csv file in scala.

When you deal with a realtime application which requires you to do some lookup before doing further processing, you will need to consider caching the data within the application, memory will be a bottleneck if you don’t efficiently cache the data, also one must consider the performance.

Let’s say the data is of the following csv format:

Surname,FirstName,Suburb,SurveyorType,Endorsement 
Akers,John,STRATHPINE,Surveyor,"Cadastral, Consulting"
Allen,Mark,KIRWAN,Surveyor,Engineering

This is just a dummy data I found online for the blog write up.

Now, lets see how we can efficiently keep this data in memory and do a faster lookup on it. By look up, I mean given the surname, first name, suburb and surveyor type it should return the Endorsement. An example SQL translation of what I wrote is given below:

SELECT Endorsement from my_file where Surname=’Allen’ and FirstName=’Mark’ and Suburb=’KIRWAN’ and SurveyorType=’Surveyor’

It should return Engineering.

There are many ways to implement this in scala, I will talk about the following 3:

  • Using standard scala Map
  • Optimizing the Scala Map
  • Using fastutils

We can start off by writing a function to calculate the memory usage and the time it takes to look up the data first.

def memInfo(): Unit ={
 System.gc()
 val mem = Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory()
 println(  mem / 1024.0 / 1024.0 + " MB")
}

def time[R](block: => R): R = {
 val t0 = System.nanoTime()
 val result = block    // call-by-name
 val t1 = System.nanoTime()
 println("Elapsed time: " + (t1 - t0) + "ns")
 result
}

Now lets create a case class named Person to keep the data

case class Person(surname: String, fname: String, suburb: String, surveyorType: String)

 

A simple standard scala Map based implementation

The simplest way to implement a solution for this type of problem is to use a standard sacla nested Map of String, ie.

Map[ String, Map[String, Map[String, Map[ String, String]]]]

Lets see how this can be implemented:

class SimpleScalaMap(filePath: String){

 //Read the file line by line and keep it in a  Map[ String, Map[String, Map[String, Map[ String, String]]]]
 private val holder = Source.fromFile(filePath, "ISO-8859-1").getLines().flatMap(line => {

   //a little hack since the data I'm using isn't very clean
   val list = line.split(",").toList
   val endorsement = if(list.size > 4 ) list(4) else "Associate"

   Map(list(0) -> Map(list(1) -> Map(list(2) -> Map(list(3) -> endorsement))))

 }).toMap

 def lookup(data: Person): String ={
   holder(data.surname)(data.fname)(data.suburb)(data.surveyorType)
 }

}

Now that we have the class, lets do a look up for the following Person record, it should return Engineering since that’s whats there in the csv file.

Person(surname = “Zannes”, fname = “Dave”, suburb = “CORNUBIA”, surveyorType = “Surveyor”)

val person = Person(surname = "Zannes", fname = "Dave", suburb = "CORNUBIA", surveyorType = "Surveyor")

val simpleMap = new SimpleScalaMap(stream)

println("Memory usage: " + memInfo())

println(time(simpleMap.lookup(person)))

simpleMap

This implementation took almost 3MB of memory and a hundred thousand nano second for the look up.

 

Optimising the Map based implementation

We can definitely optimize the above implementation, since we need to query only the endorsement column, remaining columns can be hashed and we can use a single layer map as Map[Int, String] instead of  Map[ String, Map[String, Map[String, Map[ String, String]]]]

Lets see how this is done, we can tweak the code a little bit as follows:

Instead of constructing a nested Map, we can hash the keys together and create a single layer map as follows:

//Compute a hash of the keys
val hash = s"${list(0)}${list(1)}${list(2)}${list(3)}".toLowerCase.hashCode
Map(hash -> endorsement)

And in the lookup call, we can compute the hash and then do the lookup inside the map:

def lookup(data: Person): String ={
 val hash = s"${data.surname}${data.fname}${data.suburb}${data.surveyorType}".toLowerCase.hashCode
 holder(hash)
}

Lets see how well this code performs:

optimizedMap

The lookup time now came down to less than half the time it took last time, also we saved a little bit in terms of memory.

 

Fastutil based implementation

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues with a small memory footprint and fast access and insertion; provides also big (64-bit) arrays, sets and lists, and fast, practical I/O classes for binary and text files. It is free software distributed under the Apache License 2.0. It requires Java 7 or newer.

We will be using the Int2ObjectOpenHashMap as a replacement of the Map[Int, String] from the previous implementation. Lets see how this is done:

private val holder = new Int2ObjectOpenHashMap[String](map.unzip._1.toArray, map.unzip._2.toArray)

Where map is the normal scala map from the previous version.

And in terms of performance:

fastutil

As you can see, the lookup time is again cut down to almost half of the previous version.

This can be further improved using unsafe, which I will be writing about soon. Meanwhile feel free to check the code yourself and leave any feedback that you have. 🙂

The complete code can be found here:

https://github.com/akhld/scala-csv-lookup/

2 replies on “Writing a faster memory efficient lookup table in scala”

I’ve learn a few good stuff here. Definitely value bookmarking for revisiting.

I wonder how much attempt you put to create this sort of fantastic informative web site.

These are genuinely enormous ideas in about blogging. You have
touched some good things here. Any way keep up wrinting.

Leave a Reply

Your email address will not be published. Required fields are marked *