• Calender

    July 2017
    M T W T F S S
    « May    
  • Contact

    Send us press releases, events and new product information for publication..

    Email: nchuppala@ outlook.com

What Data Scientists spend the most


Introduction to Functional Programming in Scala

Learn more about Scala, a programming language that supports both object-oriented and functional paradigms.

Scala is a multi-paradigm programming language in the sense that it supports both object-oriented and functional paradigms. It runs on the JVM and can be installed using the instructions found here: http://scala-lang.org/download/install.html.

Let’s explore some of its functional features.

Hello, world!

Who am I not to respect the classic “Hello, world!” program presented when introducing a programming language? So with my utmost respect for Brian Kernighan who created this tradition here’s the “Hello, world!” in Scala:


Listing 1: Hello, world!

object HelloWorld {
 def main(args: Array[String]): Unit = {
  println("Hello, world!")

The structure of this program consists of a singleton object, HelloWorld, which contains only one method called main. It takes the command line arguments and call the predefined method println passing in the “Hello, world!” string.

From this simple program, you may have already noticed that types in Scala follow the variable name (or parameter name in case of function). Indeed the args parameter is of type Array[String] and the main function returns a value of type Unit. For simplicity’s sake think of Unit as the typical void of other languages even if it’s not exactly the same.


In Functional Programming you tend not to have classes whose state can be changed — the so-called mutable classes. Rather, your model is represented by immutable classes. Scala offers a nice syntax to create them that falls under the name of case classes. Here is a simple example:

Listing 2. Case classes

case class Person(firstname: String, lastname: String, age: Int)

// create instances of the Person class
val bob: Person = Person("Bob", "Smith", 39)
val alice = Person("Alice", "Brown", 31)

// access fields
val aliceAge = alice.age

If you come from a Java background, you’ll notice much less boilerplate code. Note also the use of the val keyword. In Scala it is used to create immutable variables, such that once you assign a value to a val reference you can no longer change it. For example something like the following is disallowed:

val a = 42
a = 3 // error: reassignment to val

There’s another peculiarity to consider in Listing 2. I didn’t forget to declare the type of the alice variable, I left it out on purpose to demonstrate another nice feature of Scala: type inference. In fact, I could have omitted it for bob as well since Scala’s type inferer is smart enough to understand its type as it did for alice.

Case classes provides many other goodies along with pattern matching, another fundamental pillar of functional programming languages.

Pattern Matching

If this is your first encounter with pattern matching, you could consider it as an enhanced switch statement, as an oversimplification. Actually it’s much more than that. For example, examine the following code:

Listing 3. Pattern Matching

import Shape._

trait Shape

case class Rectangle(base: Double, height: Double) extends Shape
case class Circle(radius: Double) extends Shape

object Shape {

  def area(shape: Shape): Double = shape match {
    case Rectangle(b, h) => b * h
    case Circle(r) => r * r * Math.PI

val rectangle: Shape = Rectangle(4, 5)
val circle: Shape = Circle(4)

val rectangleArea = area(rectangle)
val circleArea = area(circle)

The code in bold shows pattern matching. It matches against the shape object passed as a parameter. If it’s of type Rectangle it extracts its base and height, whereas if it’s a Circle it extracts its radius. In both cases it computes the area of the shape. The extraction part is called deconstruction.

Note also the trait keyword. It’s used to define “interfaces” in Scala. I used quotes for the word since they are actually different from, say, Java interfaces but for now this similarity will do.

The import Shape._ part lets you use Shape’s area function without fully referencing it as Shape.area.

One more thing. You may have noticed that the name Shape is used both for the trait and object definition. In Scala that’s possible and frequently used. In that case the object is called a companion object of the trait and, practically, it has some implications I won’t cover here for the sake of brevity.

Pattern matching sounds good, but could a functional language be defined as such if it didn’t provide functions as first-class citizens? No, of course it couldn’t!

Functions as First Class Citizens

In FP you tend to write pure functions, functions that, given the same input, always return the same output, without producing any side effect. They are functions in the mathematical sense of the term.

A function in a language is first-class if it can be used like any other type, such as Int, String, Double, and so on. This means that it can be assigned to a variable, passed as a parameter to another function or returned by a function. Consider the following code snippet:

val double: Int => Int = x => x * 2

val increment: Int => Int = x => x + 1

def applyFunc(x: Int, f: Int => Int): Int = f(x)

In this case, double is a function from Int to Int. It takes an integer and doubles it. Notice the type definition: Int => Int. The function body — what follows the equals sign — means: “take the integer x provided by the client code and return its value multiplied by 2”. Similarly the increment function takes an integer and adds 1 to it.

The applyFunc function takes an integer and a function that takes an integer and returns an integer. The return type of applyFunc is still an integer. This function just applies the function f, passed as a parameter, to the value x, also passed as a parameter. So if you want to apply the double function to your integer you can use this function as follows:

applyFunc(21, double) // the result will be 42

Now you need to apply the increment function instead. Easy:

applyFunc(21, increment) // the result will be 22

In the FP world functions such as applyFunc are called higher-order functions (HOFs). A HOF is a function that takes another function as a parameter and/or returns a function as its result.


Of course the space instantiated for an article is not enough to cover functional programming in all its entirety or all the features of a language about which you could easily write a 1000 page book. This article didn’t even scratch the surface of Scala but, hopefully, tickled your brain a little bit.

Big companies such as Twitter, LinkedIn, Netflix, and many others, are already using Scala successfully in production. Now might be a good time to jump on board before you get left out of the boat.

Final Note: In case you’re worrying about your beloved Java library, no problem, you can use any Java API from Scala seamlessly.

Alessandro Lacava is a software designer and developer. At the moment, he is mainly interested in functional programming and languages such as Scala, Haskell and the like. He also has fun playing with Domain-Specific Languages (DSLs).

Upcoming & Recent Data Science Events

Interesting Data Science Articles and News

IBM Watson Analytics for social media analysis – Social media analysis is the new add on to Watson Analytics. Explore the five components of Watson Analytics: Explore, Predict, Assemble, Social Media, and Refine.

How do I write an effective elevator pitch?

Elliot Loh

I think of an elevator pitch as more of a state of mind than an actual script. But if I had to propose a formula, it would look something like this:
We solve [problem] by providing [advantage], to help [target] accomplish [target’s goal].
Depending on the stage you’ve reached, you might follow it up with a second sentence about your business model:
We make money by charging [customers] to get [benefit].
So for Geni, I used to say, “We solve the problem of genealogy by matching possible relatives, to help genealogists create one accurate world family tree. We make money by charging enthusiasts for enhanced search and other premium features.” For Yammer, the pitch might be, “We make companies more efficient by providing a live feed of comments and questions, so employees can find answers more quickly. We make money by charging companies that want administrative control on their employee networks.” (Note that I no longer work for either company, so my examples shouldn’t be taken to represent them.)

My approach tends toward simplicity. In an elevator or conference environment, you only have a moment to deliver your pitch. Best to craft a clear statement that can lead to a question by the receiver. And for that question, you should have plenty of material lined up to demonstrate value:

•You’re the first to do this, or
•Why you’re different and therefore better than similar products
•Why your market is worth pursuing
•How much traction you have
•Why this is difficult for others but easy for your team

As others have said, this is something that should be expressed rather than memorized. If you are passionate about your product, this will probably be a matter of whittling your message down, rather than building it up.

The data-driven businesses solving problems you didn’t know existed

Just two years ago, data scientist Alex Pentland proclaimed that we’re just beginning a ‘decade of data’. He wasn’t wrong: the world now produces 2.5 quintillion bytes of data every day.

It’s made up of everything from the photos we take on our smartphones to the sensors that track our weather. And it’s given rise to a whole new sector – businesses that use data in innovative ways, whether it’s working out which route we should drive or helping companies understand why a customer is browsing but not buying.

“In 2000, it cost ten dollars per gigabyte to store data,” says Mike Upchurch, founder of analytics software company Fuzzy Logix. “By 2005 it was 50 cents a gigabyte. Now it’s three cents a gigabyte. So people are just storing astronomical amounts of data.


“But it’s all completely worthless unless you can do two things: ask it a question and take actions on the result. We worked with a major UK supermarket, for example, using weather patterns to help determine what should be on their shelves. This resulted in millions of dollars’ worth of savings in perishable food.”

Business benefits

Conversion optimisation agency PRWD, uses data to help companies improve the percentage of website customers who complete the desired goal – whether that’s filling in a survey or buying something.

“There is a big emphasis on investing in data tools. But a lot of companies aren’t investing in the people who will actually interrogate that data and provide the stories for the business to drive an actionable output,” says PRWD’s optimisation strategist Chris McCormick.

“For example, we might look at ‘success data’ – what’s going well for an ecommerce website. Perhaps the data shows that presenting reviews to customers makes them more likely to buy. So let’s make more of those reviews, and bring them to the forefront.”


Enabling app creation

Data’s also behind the innovation success story of the last ten years – apps. Transport apps have seen massive growth – Mapway’s London Tube Map, for example, has been downloaded more than 15 million times. But that data needs work before app developers can use it, and that’s where companies like Transport API come in.

In 2010, the UK government began releasing transport data. But this isn’t always easy to use. It’s around 60 different data feeds, including everything from bus times to tube delays.

Read: Ultrafast boost for businesses from Virgin Media Business

“We saw a business opportunity there,” says Emer Coleman, TransportAPI’s business development director. “If you’re a developer and you want to make an app, you have to do a lot of work to integrate those feeds together and clean up any anomalies. We do that for them, and put it out as a single source.”

And it’s not just big business which benefits from Transport API’s source: the company’s three levels of pricing include a freemium model. “You can access 30,000 data hits a month for free,” says Emer. “That means people can experiment, do some research and development and some crucial concept testing – and they can do it for nothing.”

Making things better

Big data specialists Mastodon C’s Witan project is also aiming to use data to drive innovation. It’s building an open source platform that enables cities to use their data better, to help them plan for their future education, employment, housing and energy needs.


The UN estimates that three billion people will live in cities by 2050. So planning cities for the future is a big challenge.

“Many cities don’t have data by city: they have it by ward, or by national level,” says delivery manager Elisabeth Weise. “Another problem is that they might have the data but work in silos: population projects are done by one team, then employment projections are done by another.”

The first tool, allowing boroughs to run their own population projections, has just been launched across the 33 London boroughs. “It’s really freeing up resources and giving much more analytic power to the boroughs,” says Weise.

“There’s been a lot of enthusiasm. This is really been missing: there are a lot of niche products out there but they are very expensive. There’s no big buy-in: you can start with demography, as this underlies so much city planning. Cities are such drivers of growth – but they need to have the proper tools to be able to do it.”

This is a guest blog and may not represent the views of Virgin.com. Please see virgin.com/terms for more details. Thumbnail from gettyimages.

Introducing GraphFrames

Ankur Dave
Xiangrui Meng

We would like to thank Ankur Dave from UC Berkeley AMPLab for his contribution to this blog post.

Databricks is excited to announce the release of GraphFrames, a graph processing library for Apache Spark. Collaborating with UC Berkeley and MIT, we have built a graph library based on DataFrames. GraphFrames benefit from the scalability and high performance of DataFrames, and they provide a uniform API for graph processing available from Scala, Java, and Python.

What are GraphFrames?

GraphFrames support general graph processing, similar to Apache Spark’s GraphX library. However, GraphFrames are built on top of Spark DataFrames, resulting in some key advantages:

  • Python, Java & Scala APIs: GraphFrames provide uniform APIs for all 3 languages. For the first time, all algorithms in GraphX are available from Python & Java.
  • Powerful queries: GraphFrames allow users to phrase queries in the familiar, powerful APIs of Spark SQL and DataFrames.
  • Saving & loading graphs: GraphFrames fully support DataFrame data sources, allowing writing and reading graphs using many formats like Parquet, JSON, and CSV.

In GraphFrames, vertices and edges are represented as DataFrames, allowing us to store arbitrary data with each vertex and edge.

An example social network

Say we have a social network with users connected by relationships. We can represent the network as a graph, which is a set of vertices (users) and edges (connections between users). A toy example is shown below.

Social network graph diagram

Click on the image to see the full example notebook

We might then ask questions such as “Which users are most influential?” or “Users A and B do not know each other, but should they be introduced?” These types of questions can be answered using graph queries and algorithms.

GraphFrames can store data with each vertex and edge. In a social network, each user might have an age and name, and each connection might have a relationship type.

Social graph verticies

Social graph edges

Click on the table to see the full example notebook

Simple queries are simple

GraphFrames make it easy to express queries over graphs. Since GraphFrame vertices and edges are stored as DataFrames, many queries are just DataFrame (or SQL) queries.

How many users in our social network have “age” > 35?
We can query the vertices DataFrame:
g.vertices.filter("age > 35")

How many users have at least 2 followers?
We can combine the built-in inDegrees method with a DataFrame query.
g.inDegrees.filter("inDegree >= 2")

Graph algorithms support complex workflows

GraphFrames support the full set of algorithms available in GraphX, in all 3 language APIs. Results from graph algorithms are either DataFrames or GraphFrames. For example, what are the most important users? We can run PageRank:

results = g.pageRank(resetProbability=0.15, maxIter=10)

PageRank results

Click on the table to see the full example notebook

GraphFrames also support new algorithms:

  • Breadth-first search (BFS): Find shortest paths from one set of vertices to another
  • Motif finding: Search for structural patterns in a graph

Motif finding lets us make powerful queries. For example, to recommend whom to follow, we might search for triplets of users A,B,C where A follows B and B follows C, but A does not follow C.

# Motif: A->B->C but not A->C
results = g.find("(A)-[]->(B); (B)-[]->(C); !(A)-[]->(C)")
# Filter out loops (with DataFrame operation)
results = results.filter("A.id != C.id")
# Select recommendations for A to follow C
results = results.select("A", "C")

Motif findings

Click on the table to see the full example notebook

The full set of GraphX algorithms supported by GraphFrames is:

  • PageRank: Identify important vertices in a graph
  • Shortest paths: Find shortest paths from each vertex to landmark vertices
  • Connected components: Group vertices into connected subgraphs
  • Strongly connected components: Soft version of connected components
  • Triangle count: Count the number of triangles each vertex is part of
  • Label Propagation Algorithm (LPA): Detect communities in a graph

GraphFrames integrate with GraphX

GraphFrames fully integrate with GraphX via conversions between the two representations, without any data loss. We can convert our social network to a GraphX graph and back to a GraphFrame.

val gx: Graph[Row, Row] = g.toGraphX()
val g2: GraphFrame = GraphFrame.fromGraphX(gx)

See the GraphFrame API docs for more details on these conversions.

What’s next?

Graph-specific optimizations for DataFrames are under active research and development. Watch Ankur Dave’s Spark Summit East 2016 talk to learn more. We plan to include some of these optimizations in GraphFrames for its next release!

Get started with these tutorial notebooks in Scala and Python in the Databricks Community Edition beta program. If you do not have access to the beta yet, join the beta waitlist.
Download the GraphFrames package from the Spark Packages website. GraphFrames are compatible with Spark 1.4, 1.5, and 1.6.
Learn more in the User Guide and API docs.

The code is available on Github under the Apache 2.0 license. We welcome contributions! Check the Github issues for ideas to work on.