• Calender

    May 2015
    M T W T F S S
    « Apr   Jun »
  • Contact

    Send us press releases, events and new product information for publication..

    Email: nchuppala@ outlook.com

Apache Flink: New Hadoop contender squares off against Spark


The quest to replace Hadoop’s aging MapReduce is a bit like waiting for buses in Britain. You watch a really long time, then a bunch come along at once. We already have Tez and Spark in the mix, but there’s a new contender for the heart of Hadoop, and it comes from Europe: Apache Flink (German for “quick” or “nimble”).

mobile smartphone hands user


Flink sprung from Berlin’s Technical University, and it used to be known as Stratosphere before it was added to Apache’s incubator program. It’s a replacement for Hadoop MapReduce that works in both batch and streaming modes, eliminating the mapping and reducing jobs in favor of a directed graph approach that leverages in-memory storage for massive performance gains.

Some of you may have read the last paragraph and thought, “Hang on, isn’t that Apache Spark?” You’d be right; Spark and Flink have a lot in common. Here’s some Scala that shows a simple word count operation in Flink:

case class Word (word: String, frequency: Int)
val counts = text
 .flatMap {line => line.split(" ").map(word => Word(word,1))}

Here’s a Scala implementation of word count for Spark:

val counts = text
 .flatMap(line => line.split(" ")).map(word => (word, 1))
 .reduceByKey{case (x, y) => x + y}

As you can see, while there are some differences in syntactic sugar, the APIs are rather similar. I’m a fan of Flink’s use of case classes over Spark’s tuple-based PairRDD construction, but there’s not much in it. Given that Apache Spark is now a stable technology used in many enterprises across the world, another data processing engine seems superfluous. Why should we care about Flink?

The reason Flink may be important lies in the dirty little secret at the heart of Spark Streaming, one you may have come across in a production setting: Instead of being a pure stream-processing engine, it is in fact a fast-batch operation working on a small part of incoming data during a unit of time (known in Spark documentation as “micro-batching”). For many applications, this is not an issue, but where low latency is required (such as financial systems and real-time ad auctions) every millisecond lost can lead to monetary consequences.

Flink flips this on its head. Whereas Spark is a batch processing framework that can approximate stream processing, Flink is primarily a stream processing framework that can look like a batch processor. Immediately you get the benefit of being able to use the same algorithms in both streaming and batch modes (exactly as you do in Spark), but you no longer have to turn to a technology like Apache Storm if you require low-latency responsiveness. You get all you need in one framework, without the overhead of programming and maintaining a separate cluster with a different API.

Also, Flink borrows from the crusty-but-still-has-a-lot-to-teach-us RDBMS to bring us an aggressive optimization engine. Similar to a SQL database’s query planner, the Flink optimizer analyzes the code submitted to the cluster and produces what it thinks is the best pipeline for running on that particular setup (which may be different if the cluster is larger or smaller).

For extra speed, it allows iterative processing to take place on the same nodes rather than having the cluster run each iteration independently. With a bit of reworking of your code to give the optimizer some hints, it can increase performance even further by performing delta iterations only on parts of your data set that are changing (in some cases offering a five-fold speed increase over Flink’s standard iterative process).

Flink has a few more tricks up its sleeve. It is built to be a good YARN citizen (which Spark has not quite achieved yet), and it can run existing MapReduce jobs directly on its execution engine, providing an incremental upgrade path that will be attractive to organizations already heavily invested in MapReduce and loath to start from scratch on a new platform. Flink even works on Hortonworks’ Tez runtime, where it sacrifices some performance for the scalability that Tez can provide.

In addition, Flink takes the approach that a cluster should manage itself rather than require a heavy dose of user tuning. To this end, it has its own memory management system, separate from Java’s garbage collector. While this is normally Something You Shouldn’t Do, high-performance clustered computing changes the rules somewhat. By managing memory explicitly, Flink almost eliminates the memory spikes you often see on Spark clusters. To aid in debugging, Flink supplies its equivalent of a SQL EXPLAIN command. You can easily get the cluster to dump a JSON representation of the pipelines it has constructed for your job, and you can get a quick overview of the optimizations Flink has performed through a built-in HTML viewer, providing better transparency than in Spark at times.

But let’s not count out Spark yet. Flink is still an incubating Apache project. It has only been tested in smaller installations of up to 200 nodes and has limited production deployment at this time (although it’s said to be in testing at Spotify). Spark has a large lead when it comes to mature machine learning and graph processing libraries, although Flink’s maintainers are working on their own versions of MLlib and GraphX. Flink currently lacks a Python API, and most important, it does not have a REPL (read-eval-print-loop), so it’s less attractive to data scientists — though again, these deficiencies have been recognized and are being remedied. I’d bet on both a REPL and Python support arriving before the end of 2015.

Flink seems be a project that has definite promise. If you’re currently using Spark, it might be worthwhile standing up a Flink cluster for evaluation purposes (especially if you’re using Spark Streaming). However, I wonder whether all of the “next-generation MapReduce” communities (including Tez and Impala along with Spark and Flink) might be better served if there were less duplication of effort and more cooperation among the groups. Can’t we all just get along?

Spark Cluster Computing with Working Sets

Spark Cluster Computing with Working Sets


BDAS, the Berkeley Data Analytics Stack

BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.


BDAS consists of the components shown below. Components shown in Blue or Green are available for download now. Click on a title to go that project’s homepage.

In-house Apps
Access and
Processing Engine
Resource Virtualization
AMPLab Developed Spark Community 3rd Party In Development



BDAS will continue to evolve over the life of the AMPLab project, as existing components evolve and mature and new ones are added.


  • Software project Meetups – Help organize monthly developer meetups around BDAS components. Check out theSpark/Shark meetup group, the Mesos meetup group, and the Tachyon meetup group
  • AMP Camp “Big Data Bootcamp” – Two days packed full of software system intros, demos and hands-on exercises. Aims to bring practitioners with no prior experience up to speed and writing real code with real advanced algorithms.
  • Support – Unlike many research software prototypes that never see production use, we support BDAS software components by actively monitoring and responding on developer and user mailing lists.

Spark SQL: Relational Data Processing in Spark

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark’s functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g., schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model


Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More

In this post, we look back and cover recent performance efforts in Spark. In a follow-up blog post next week, we will look forward and share with you our thoughts on the future evolution of Spark’s performance.

2014 was the most active year of Spark development to date, with major improvements across the entire engine. One particular area where it made great strides was performance: Spark set a new world record in 100TB sorting, beating the previous record held by Hadoop MapReduce by three times, using only one-tenth of the resources; it received a new SQL query engine with a state-of-the-art optimizer; and many of its built-in algorithms became five times faster.

Back in 2010, we at the AMPLab at UC Berkeley designed Spark for interactive queries and iterative algorithms, as these were two major use cases not well served by batch frameworks like MapReduce. As a result, early users were drawn to Spark because of the significant performance improvements in these workloads. However, performance optimization is a never-ending process, and as Spark’s use cases have grown, so have the areas looked at for further improvement. User feedback and detailed measurements helped the Apache Spark developer community to prioritize areas to work in. Starting with the core engine, I will cover some of the recent optimizations that have been made.

The Spark ecosystem

Core engine

One unique thing about Spark is its user-facing APIs (SQL, streaming, machine learning, etc.) run over a common core execution engine. Whenever possible, specific workloads are sped up by making optimizations in the core engine. As a result, these optimizations speed up all components. We’ve often seen very surprising results this way: for example, when core developers decreased latency to introduce Spark Streaming, we also saw SQL queries become two times faster.

In the core engine, the major improvements in 2014 were in communication. First, shuffle is the operation that moves data point-to-point across machines. It underpins almost all workloads. For example, a SQL query joining two data sources uses shuffle to move tuples that should be joined together onto the same machine, and product recommendation algorithms such as ALS use shuffle to send user/product weights across the network.

The last two releases of Spark featured a new sort-based shuffle layer and a new network layer based on Netty with zero-copy and explicit memory management. These two make Spark more robust in very large-scale workloads. In our own experiments at Databricks, we have used this to run petabyte shuffles on 250,000 tasks. These two changes were also the key to Spark setting the current world record in large-scale sorting, beating the previous Hadoop-based record by 30 times in per-node performance.

In addition to shuffle, core developers rewrote Spark’s broadcast primitive to use a BitTorrent-like protocol to reduce network traffic. This speeds up workloads that need to send a large parameter to multiple machines, including SQL queries and many machine learning algorithms. We have seen more than five times performance improvements for these workloads.

Python API (PySpark)

Python is perhaps the most popular programming language used by data scientists. The Spark community views Python as a first-class citizen of the Spark ecosystem. When it comes to performance, Python programs historically lag behind their JVM counterparts due to the more dynamic nature of the language.

Spark’s core developers have worked extensively to bridge the performance gap between JVM languages and Python. In particular, PySpark can now run on PyPy to leverage the just-in-time compiler, in some cases improving performance by a factor of 50. The way Python processes communicate with the main Spark JVM programs have also been redesigned to enable worker reuse. In addition, broadcasts are handled via a more optimized serialization framework, enabling PySpark to broadcast data larger than 2GB. The latter two have made general Python program performance two to 10 times faster.


One year ago, Shark, an earlier SQL on Spark engine based on Hive, was deprecated and we at Databricks built a new query engine based on a new query optimizer, Catalyst, designed to run natively on Spark. It was a controversial decision, within the Apache Spark developer community as well as internally within Databricks, because building a brand new query engine necessitates astronomical engineering investments. One year later, more than 115 open source contributors have joined the project, making it one of the most active open source query engines.

Shark vs. Spark SQL

Despite being less than a year old, Spark SQL is outperforming Shark on almost all benchmarked queries. In TPC-DS, a decision-support benchmark, Spark SQL is outperforming Shark often by an order of magnitude, due to better optimizations and code generation.

Machine learning (MLlib) and Graph Computation (GraphX)

From early on, Spark was packaged with powerful standard libraries that can be optimized along with the core engine. This has allowed for a number of rich optimizations to these libraries. For instance, Spark 1.1 featured a new communication pattern for aggregating machine learning models usingmulti-level aggregation trees. This has reduced the model aggregation time by an order of magnitude. This new communication pattern, coupled with the more efficient broadcast implementation in core, results in speeds 1.5 to five times faster across all algorithms.


In addition to optimizations in communication, Alternative Least Squares (ALS), a common collaborative filtering algorithm, was also re-implemented 1.3, which provided another factor of two speedup for ALS over what the above chart shows. In addition, all the built-in algorithms in GraphX have also seen 20% to 50% runtime performance improvements, due to a new optimized API.

DataFrames: Leveling the Field for Python and JVM

In Spark 1.3, we introduced a new DataFrame API. This new API makes Spark programs more concise and easier to understand, and at the same time exposes more application semantics to the engine. As a result, Spark can use Catalyst to optimize these programs.

DataFrame performance

Through the new DataFrame API, Python programs can achieve the same level of performance as JVM programs because the Catalyst optimizer compiles DataFrame operations into JVM bytecode. Indeed, performance sometimes beats hand-written Scala code.

The Catalyst optimizer will also become smarter over time, picking better logical optimizations and physical execution optimizations. For example, in the future, Spark will be able to leverage schema information to create a custom physical layout of data, improving cache locality and reducing garbage collection. This will benefit both Spark SQL and DataFrame programs. As more libraries are converting to use this new DataFrame API, they will also automatically benefit from these optimizations.

The goal of Spark is to offer a single platform where users can get the best distributed algorithms for any data processing task. We will continue to push the boundaries of performance, making Spark faster and more powerful for more users.

Note: An earlier version of this blog post appeared on O’Reilly Radar.

The IoT is creating a perfect storm of innovation and opportunity


Hardware, software, sensors, and physical things are coming together in uncharted waters. To succeed, you’ll need to build teams that cross disciplines in ways never before attempted. Envision new business models. And recognize the “crazy” ideas that are now entirely possible. Learn how at Solid


  • Engineers are building products never before imagined.
  • Programmers are reaching beyond the screen in ways that would make a sci-fi writer envious.
  • Designers are creating UXs that make previously outlandish ideas seem obvious and intuitive.
  • Entrepreneurs and business leaders are inventing entirely new business models.

The technologies that connect things—from turbines to watches—to the power of software, the Internet, and big data are also being used to streamline industrial practices, lower the barriers to entry, and disrupt everything from customization to upgrades in a way that will revolutionize manufacturing, even in the oldest and most staid industries.

Enormous new opportunities are emerging for those who recognize them.

What’s happening is more than just an incremental advance. And Solid is more than just a conference. It’s unique—a mash-up of MIT and Disneyland, where you’ll:

  • engage in deep, intelligent conversations about vital issues like security, additive manufacturing, data architectures, breakthrough design, prototyping, and standards
  • see jaw-dropping demos of the coolest devices, drones, robots, and wearables that exist (or are imagined) today
  • connect and collaborate with some of the most interesting people in the industry—from both Fortune 500 companies and under-the-radar startups
  • hear unexpected, inspirational, and just-plain-brilliant ideas that will give you a glimpse of what the future may hold

When anything can be connected to the Internet, every company becomes a tech company. Whether you deal in fashion or manufacturing, robotics or agriculture, Solid will help you figure out how to apply the latest technologies to your business.

If you want to be a part of this—and who doesn’t?—join us at Solid in San Francisco June 23-25.


Gartner’s 2013 Hype Cycle for Emerging Technologies Maps Out Evolving Relationship Between Humans and Machines

2013 Hype Cycle Special Report Evaluates the Maturity of More Than 1,900 Technologies

The evolving relationship between humans and machines is the key theme of Gartner, Inc.’s “Hype Cycle for Emerging Technologies, 2013.” Gartner has chosen to feature the relationship between humans and machines due to the increased hype around smart machines, cognitive computing and the Internet of Things. Analysts believe that the relationship is being redefined through emerging technologies, narrowing the divide between humans and machines.

Gartner’s 2013 Hype Cycle Special Report provides strategists and planners with an assessment of the maturity, business benefit and future direction of more than 2,000 technologies, grouped into 98 areas. New Hype Cycles this year include content and social analytics, embedded software and systems, consumer market research, open banking, banking operations innovation, and information and communication technology (ICT) in Africa.

The Hype Cycle for Emerging Technologies report is the longest-running annual Hype Cycle, providing a cross-industry perspective on the technologies and trends that senior executives, CIOs, strategists, innovators, business developers and technology planners should consider in developing emerging-technology portfolios.

“It is the broadest aggregate Gartner Hype Cycle, featuring technologies that are the focus of attention because of particularly high levels of hype, or those that Gartner believes have the potential for significant impact,” said Jackie Fenn, vice president and Gartner fellow.

“In making the overriding theme of this year’s Hype Cycle the evolving relationship between humans and machines, we encourage enterprises to look beyond the narrow perspective that only sees a future in which machines and computers replace humans. In fact, by observing how emerging technologies are being used by early adopters, there are actually three main trends at work. These are augmenting humans with technology — for example, an employee with a wearable computing device; machines replacing humans — for example, a cognitive virtual assistant acting as an automated customer representative; and humans and machines working alongside each other — for example, a mobile robot working with a warehouse employee to move many boxes.”

“Enterprises of the future will use a combination of these three trends to improve productivity, transform citizen and customer experience, and to seek competitive advantage,” said Hung LeHong, research vice president at Gartner. “These three major trends are made possible by three areas that facilitate and support the relationship between human and machine. Machines are becoming better at understanding humans and the environment — for example, recognizing the emotion in a person’s voice — and humans are becoming better at understanding machines — for example, through the Internet of things. At the same time, machines and humans are getting smarter by working together.”

Figure 1. Hype Cycle for Emerging Technologies, 2013

Gartner Hype Cycles 2013

Source: Gartner August 2013

The 2013 Emerging Technologies Hype Cycle highlights technologies that support all six of these areas including:

1. Augmenting humans with technology

Technologies make it possible to augment human performance in physical, emotional and cognitive areas. The main benefit to enterprises in augmenting humans with technology is to create a more capable workforce. For example, consider if all employees had access to wearable technology that could answer any product or service question or pull up any enterprise data at will. The ability to improve productivity, sell better or serve customer better will increase significantly. Enterprises interested in these technologies should look to bioacoustic sensing, quantified self, 3D bioprinting, brain-computer interface, human augmentation, speech-to-speech translation, neurobusiness, wearable user interfaces, augmented reality and gesture control.

2. Machines replacing humans

There are clear opportunities for machines to replace humans: dangerous work, simpler yet expensive-to-perform tasks and repetitive tasks. The main benefit to having machines replace humans is improved productivity, less danger to humans and sometimes better quality work or responses. For example, a highly capable virtual customer service agent could field the many straightforward questions from customers and replace much of the customer service agents’ “volume” work — with the most up-to-date information. Enterprises should look to some of these representative technologies for sources of innovation on how machines can take over human tasks: volumetric and holographic displays, autonomous vehicles, mobile robots and virtual assistants.

3. Humans and machines working alongside each other

Humans versus machines is not a binary decision, there are times when machines working alongside humans is a better choice. A new generation of robots is being built to work alongside humans. IBM’s Watson does background research for doctors, just like a research assistant, to ensure they account for all the latest clinical, research and other information when making diagnoses or suggesting treatments. The main benefits of having machines working alongside humans are the ability to access the best of both worlds (that is, productivity and speed from machines, emotional intelligence and the ability to handle the unknown from humans). Technologies that represent and support this trend include autonomous vehicles, mobile robots, natural language question and answering, and virtual assistants.

The three trends that will change the workforce and the everyday lives of humans in the future are enabled by a set of technologies that help both machine and humans better understand each other. The following three areas are a necessary foundation for the synergistic relationships to evolve between humans and machines:

4. Machines better understanding humans and the environment

Machines and systems can only benefit from a better understanding of human context, humans and human emotion. This understanding leads to simple context-aware interactions, such as displaying an operational report for the location closest to the user; to better understanding customers, such as gauging consumer sentiment for a new product line by analyzing Facebook postings; to complex dialoguing with customers, such as virtual assistants using natural language question and answering to interact on customer inquiries. The technologies on this year’s Hype Cycle that represent these capabilities include bioacoustic sensing, smart dust, quantified self, brain computer interface, affective computing, biochips, 3D scanners, natural-language question and answering (NLQA), content analytics, mobile health monitoring, gesture control, activity streams, biometric authentication methods, location intelligence and speech recognition.

5. Humans better understanding machines

As machines get smarter and start automating more human tasks, humans will need to trust the machines and feel safe. The technologies that make up the Internet of things will provide increased visibility into how machines are operating and the environmental situation they are operating in. For example, IBM’s Watson provides “confidence” scores for the answers it provides to humans while Baxter shows a confused facial expression on its screen when it does not know what to do. MIT has also been working on Kismet, a robot that senses social cues from visual and auditory sensors, and responds with facial expressions that demonstrate understanding. These types of technology are very important in allowing humans and machines to work together. The 2013 Hype Cycle features Internet of Things, machine-to-machine communication services, mesh networks: sensor and activity streams.

6. Machines and humans becoming smarter

The surge in big data, analytics and cognitive computing approaches will provide decision support and automation to humans, and awareness and intelligence to machines. These technologies can be used to make both humans and things smarter. NLQA technology can improve a virtual customer service representative. NLQA can also be used by doctors to research huge amounts of medical journals and clinical tests to help diagnose an ailment or choose a suitable treatment plan. These supporting technologies are foundational for both humans and machines as we move forward to a digital future and enterprises should consider quantum computing, prescriptive analytics, neurobusiness, NLQA, big data, complex event processing, in-memory database management system (DBMS), cloud computing, in-memory analytics and predictive analytics.

Additional information is available in Gartner’s “Hype Cycle for Emerging Technologies, 2013” at http://www.gartner.com/resId=2571624. The Special Report includes a video in which Ms. Fenn provides more details regarding this year’s Hype Cycles, as well as links to all of the Hype Cycle reports. The Special Report can be found at http://www.gartner.com/technology/research/hype-cycles/.

Mr. LeHong and Ms. Fenn will provide additional analysis during the Gartner webinar “Emerging Technologies Hype Cycle for 2013: Redefining the Relationship” on August 21, at 10 a.m. EDT and 1 p.m. EDT. To register for one of these complimentary webinars, please visit http://my.gartner.com/portal/server.pt?open=512&objID=202&mode=2&PageID=5553&resId=2546719&ref=Webinar-Calendar.