• Calender

    December 2014
    M T W T F S S
    « Nov   Jan »
  • Contact

    Send us press releases, events and new product information for publication..

    Email: nchuppala@ outlook.com

Big Data: Using Smart Big Data, Analytics and Metrics to Make Better Decisions and Improve Performance


Convert the promise of big data into real world results

There is so much buzz around big data. We all need to know what it is and how it works – that much is obvious. But is a basic understand of the theory enough to hold your own in strategy meetings? Probably. But what will set you apart from the rest is actually knowing how to USE big data to get solid, real–world business results – and putting that in place to improve performance. Big Data will give you a clear understanding, blueprint, and step–by–step approach to build your own big data strategy. This is a well–needed practical introduction to actually putting the topic into practice. Illustrated with numerous real–world examples from a cross section of companies and organisations, will take you through the five steps of the SMART model: Start with Strategy, Measure Metrics and Data, Apply Analytics, Report Results, Transform.

  • Discusses how companies need to clearly define what it is they need to know
  • Outlines how companies can collect relevant data and measure the metrics that will help them answer their most important business questions
  • Addresses how the results of big data analytics can be visualised and communicated to ensure key decisions–makers understand them
  • Includes many high–profile case studies from the author′s work with some of the world′s best known brands

Big Data: The Key Vocabulary Everyone Should Understand

Guest blog post by Bernard Marr, first published here.

The field of Big Data requires more clarity and I am a big fan of simple explanations. This is why I have attempted to provide simple explanations for some of the most important technologies and terms you will come across if you’re looking at getting into big data.

However, if you are completely new to the topic then you might want to start here:What the Heck is… Big Data? …and then come back to this list later.

Here they some of the key terms:

Algorithm: A mathematical formula or statistical process run by software to perform an analysis of data. It usually consists of multiple calculations steps and can be used to automatically process data or solve problems.

Amazon Web Services: A collection of cloud computing services offered by Amazon to help businesses carry out large scale computing operations (such as big data projects) without having to invest in their own server farms and data storage warehouses. Essentially, Storage space, processing power and software operations are rented rather than having to be bought and installed from scratch.

Analytics: The process of collecting, processing and analyzing data to generate insights that inform fact-based decision-making. In many cases it involves software-based analysis using algorithms. For more, have a look at my post: What the Heck is… Analytics

Big Table: Google’s proprietary data storage system, which it uses to host, among other things its Gmail, Google Earth and Youtube services. It is also made available for public use through the Google App Engine.

Biometrics: Using technology and analytics to identify people by one or more of their physical traits, such as face recognition, iris recognition, fingerprint recognition, etc. For more, see my post: Big Data and Biometrics

Cassandra: A popular open source database management system managed by The Apache Software Foundation that has been designed to handle large volumes of data across distributed servers.

Cloud: Cloud computing, or computing “in the cloud”, simply means software or data running on remote servers, rather than locally. Data stored “in the cloud” is typically accessible over the internet, wherever in the world the owner of that data might be. For more, check out my post: What The Heck is… The Cloud?

Distributed File System: Data storage system designed to store large volumes of data across multiple storage devices (often cloud based commodity servers), to decrease the cost and complexity of storing large amounts of data.

Data Scientist: Term used to describe an expert in extracting insights and value from data. It is usually someone that has skills in analytics, computer science, mathematics, statistics, creativity, data visualisation and communication as well as business and strategy.

Gamification: The process of creating a game from something which would not usually be a game. In big data terms, gamification is often a powerful way of incentivizing data collection. For more on this read my post: What The Heck is… Gamification?

Google App Engine: Google’s own cloud computing platform, allowing companies to develop and host their own services within Google’s cloud servers. Unlike Amazon’s Web Services, it is free for small-scale projects.

HANA: High-performance Analytical Application – a software/hardware in-memory platform from SAP, designed for high volume data transactions and analytics.

Hadoop: Apache Hadoop is one of the most widely used software frameworks in big data. It is a collection of programs which allow storage, retrieval and analysis of very large data sets using distributed hardware (allowing the data to be spread across many smaller storage devices rather than one very large one). For more, read my post: What the Heck is… Hadoop? And Why You Should Know About It

Internet of Things: A term to describe the phenomenon that more and more everyday items will collect, analyse and transmit data to increase their usefulness, e.g. self-driving cars, self-stocking refrigerators. For more, read my post: What The Heck is… The Internet of Things?

MapReduce: Refers to the software procedure of breaking up an analysis into pieces that can be distributed across different computers in different locations. It first distributes the analysis (map) and then collects the results back into one report (reduce). Several companies including Google and Apache (as part of its Hadoop framework) provide MapReduce tools.

Natural Language Processing: Software algorithms designed to allow computers to more accurately understand everyday human speech, allowing us to interact more naturally and efficiently with them.

NoSQL: Refers to database management systems that do not (or not only) use relational tables generally used in traditional database systems. It refers to data storage and retrieval systems that are designed for handling large volumes of data but without tabular categorisation (or schemas).

Predictive Analytics: A process of using analytics to predict trends or future events from data.

RA popular open source software environment used for analytics.

RFID: Radio Frequency Identification. RFID tags use Automatic Identification and Data Capture technology to allow information about their location, direction of travel or proximity to each other to be transmitted to computer systems, allowing real-world objects to be tracked online.

Software-As-A-Service (SAAS): The growing tendency of software producers to provide their programs over the cloud – meaning users pay for the time they spend using it (or the amount of data they access) rather than buying software outright.

Structured v Unstructured Data: Structured data is basically anything than can be put into a table and organized in such a way that it relates to other data in the same table. Unstructured data is everything that can’t – email messages, social media posts and recorded human speech, for example.

I hope this was useful? As always, I would love to hear your views. Would you add any terms to this list, if so – then please feel free to do so in the comments.

I really appreciate that you are reading my post. Here, at LinkedIn, I regularly write about management and technology issues and trends. If you would like to read my regular posts then please click ‘Follow‘ and send me a LinkedIn invite. And, of course, feel free to also connect via TwitterFacebook and The Advanced Performance Institute.

AboutBernard Marr is a globally recognized expert in strategy, performance management, analytics, KPIs and big data. He helps companies and executive teams manage, measure, analyze and improve performance.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Bette…

Glue code (glue code language)

Glue code, also called binding code, is custom-written programming that connects incompatible software components.

Glue code can be written in the same language as the code it is connecting together, but it is often written in a specialized interpreted scripting language for connecting system components called a glue language. Popular glue languages include include AppleScript, JavaScript, Perl, PHP, Python, Ruby, VBScript and PowerShell.

In addition to connecting disparate software modules, glue code can be used to tie together multiple systems. If an organization runs cloud services on both Amazon and Google, for example, glue code can be written to allow workloads and data flow between the two companies’ servers. Glue code is also useful for custom shell commands, application wrappers and rapid application prototyping.

Glue code is sometimes looked upon as a necessary evil because it can easily become the weakest link for service level agreements (SLAs) and, if not managed properly, become excessively complicated spaghetti code that negatively affects performance.

Squint the right way, you will notice that graphs are everywhere

Large-scale graph computing at Google
Posted: Monday, June 15, 2009

Posted by Grzegorz Czajkowski, Systems Infrastructure Team

If you squint the right way, you will notice that graphs are everywhere. For example, social networks, popularized by Web 2.0, are graphs that describe relationships among people. Transportation routes create a graph of physical connections among geographical locations. Paths of disease outbreaks form a graph, as do games among soccer teams, computer network topologies, and citations among scientific papers. Perhaps the most pervasive graph is the web itself, where documents are vertices and links are edges. Mining the web has become an important branch of information technology, and at least one major Internet company has been founded upon this graph.

Despite differences in structure and origin, many graphs out there have two things in common: each of them keeps growing in size, and there is a seemingly endless number of facts and details people would like to know about each one. Take, for example, geographic locations. A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful — provided you have the right tools and inputs. The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it.

In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges’ states, and mutate the graph’s topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).

Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel’s applicability is harder to quantify, but so far we haven’t come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. Developers of dozens of Pregel applications within Google have found that “thinking like a vertex,” which is the essence of programming in Pregel, is intuitive.

We’ve been using Pregel internally for a while now, but we are beginning to share information about it outside of Google. Greg Malewicz will be speaking at the joint industrial track between ACM PODC and ACM SPAA this August on the very subject. In case you aren’t able to join us there, here’s a spoiler: The seven bridges of Königsberg — inspiration for Leonhard Euler’s famous theorem that established the basics of graph theory — spanned the Pregel river.

Storm or Spark: Choose your real-time weapon


The idea of real-time business intelligence has been around for a while (see the Wikipedia page on the topic begun in 2006). But while people have been talking about the idea for years, I haven’t seen many enterprises actually embrace the vision, much less realize the benefits it enables.
At least part of the reason has been the lack of tooling for implementing BI and analytics in real time. Traditional data-warehousing environments were heavily oriented toward batch operations with extremely high latencies, were incredibly expensive, or both.

A number of powerful, easy-to-use open source platforms have emerged to change this. Two of the most notable ones are Apache Storm and Apache Spark, which offer real-time processing capabilities to a much wider range of potential users. Both are projects within the Apache Software Foundation, and while the two tools provide overlapping capabilities, they each have distinctive features and roles to play.
new tech forum

hadoop thinkstock

11 open source tools for making the most of machine learning

Storm: The Hadoop of real-time processing

Storm, a distributed computation framework for event stream processing, began life as a project of BackType, a marketing intelligence company bought by Twitter in 2011. Twitter soon open-sourced the project and put it on GitHub, but Storm ultimately moved to the Apache Incubator and became an Apache top-level project in September 2014.

Storm has sometimes been referred to as the Hadoop of real-time processing. The Storm documentation appears to agree: “Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.”

To meet this end, Storm is designed for massive scalability, supports fault-tolerance with a “fail fast, auto restart” approach to processes, and offers a strong guarantee that every tuple will be processed. Storm defaults to an “at least once” guarantee for messages, but offers the ability to implement “exactly once” processing as well.

Storm is written primarily in Clojure and is designed to support wiring “spouts” (think input streams) and “bolts” (processing and output modules) together as a directed acyclic graph (DAG) called a topology. Storm topologies run on clusters and the Storm scheduler distributes work to nodes around the cluster, based on the topology configuration.

You can think of topologies as roughly analogous to a MapReduce job in Hadoop, except that given Storm’s focus on real-time, stream-based processing, topologies default to running forever or until manually terminated. Once a topology is started, the spouts bring data into the system and hand the data off to bolts (which may in turn hand data to subsequent bolts) where the main computational work is done. As processing progresses, one or more bolts may write data out to a database or file system, send a message to another external system, or otherwise make the results of the computation available to the users.

One of the strengths of the Storm ecosystem is a rich array of available spouts specialized for receiving data from all types of sources. While you may have to write custom spouts for highly specialized applications, there’s a good chance you can find an existing spout for an incredibly large variety of sources — from the Twitter streaming API to Apache Kafka to JMS brokers to everything in between.

Adapters exist to make it straightforward to integrate with HDFS file systems, meaning Storm can easily interoperate with Hadoop if needed. Another strength of Storm is its support for multilanguage programming. While Storm itself is based on Clojure and runs on the JVM, spouts and bolts can be written in almost any language, including non-JVM languages that take advantage of a protocol for communicating between components using JSON over stdin/stdout.

In short, Storm is a very scalable, fast, fault-tolerant open source system for distributed computation, with a special focus on stream processing. Storm excels at event processing and incremental computation, calculating rolling metrics in real time over streams of data. While Storm also provides primitives to enable generic distributed RPC and can theoretically be used to assemble almost any distributed computation job, its strength is clearly event stream processing.
Spark: Distributed processing for all

Spark, another project suited to real-time distributed computation, started out as a project of AMPLab at the University of California at Berkeley before joining the Apache Incubator and ultimately graduating as a top-level project in February 2014. Like Storm, Spark supports stream-oriented processing, but it’s more of a general-purpose distributed computing platform.

As such, Spark can be seen as a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster, relying on YARN for resource scheduling. In addition to Hadoop YARN, Spark can layer on top of Mesos for scheduling or run as a stand-alone cluster using its built-in scheduler. Note that if Spark is not used with Hadoop, some type of network/distributed file system (NFS, AFS, and so on) is still required if running on a cluster, so each node will have access to the underlying data.

Spark is written in Scala and, like Storm, supports multilanguage programming, although Spark provides specific API support only for Scala, Java, and Python. Spark does not have the specific abstraction of a “spout,” but includes adapters for working with data stored in numerous disparate sources, including HDFS files, Cassandra, HBase, and S3.

Where Spark shines is in its support for multiple processing paradigms and the supporting libraries. Yes, Spark supports a streaming model, but this support is provided by only one of several Spark modules, including purpose-built modules for SQL access, graph operations, and machine learning, along with stream processing.

Recent Java How-Tos
holiday lights neurons network stream

Socket programming for scalable systems


Responsive web design with Google Web Toolkit

on target

Stability patterns applied in a RESTful architecture

Popular on JavaWorld
Java programming with lambda expressions

Hello, OSGi, Part 1: Bundles for beginners


Review: 10 JavaScript editors and IDEs put to the test

Like Storm, Spark is designed for massive scalability, and the Spark team has documented users of the system running production clusters with thousands of nodes. In addition, Spark won the recent 2014 Daytona GraySort contest, turning in the best time for a shouldering workload consisting of sorting 100TB of data. The Spark team also documents Spark ETL operations with production workloads in the multiple Petabyte range.

Spark is a fast, scalable, and flexible open source distributed computing platform, compatible with Hadoop and Mesos, which supports several computational models, including streaming, graph-centric operations, SQL access, and distributed machine learning. Spark has been documented to scale exceptionally well and, like Storm, is an excellent platform on which to build a real-time analytics and business intelligence system.

Making your decision

How do you choose between Storm and Spark?

If your requirements are primarily focused on stream processing and CEP-style processing and you are starting a greenfield project with a purpose-built cluster for the project, I would probably favor Storm — especially when existing Storm spouts that match your integration requirements are available. This is by no means a hard and fast rule, but such factors would at least suggest beginning with Storm.

On the other hand, if you’re leveraging an existing Hadoop or Mesos cluster and/or if your processing needs involve substantial requirements for graph processing, SQL access, or batch processing, you might want to look at Spark first.

Another factor to consider is the multilanguage support of the two systems. For example, if you need to leverage code written in R or any other language not natively supported by Spark, then Storm has the advantage of broader language support. By the same token, if you must have an interactive shell for data exploration using API calls, then Spark offers you a feature that Storm doesn’t.

In the end, you’ll probably want to perform a detailed analysis of both platforms before making a final decision. I recommend using both platforms to build a small proof of concept — then run your own benchmarks with a workload that mirrors your anticipated workloads as closely as possible before fully committing to either.

Of course, you don’t need to make an either/or decision. Depending on your workloads, infrastructure, and requirements, you may find that the ideal solution is a mixture of Storm and Spark — along with other tools like Kafka, Hadoop, Flume, and so on. Therein lies the beauty of open source.

Whichever route you choose, these tools demonstrate that the real-time BI game has changed. Powerful options once available only to an elite few are now within the reach of most, if not all, midsize-to-large organizations. Take advantage of them.

This story, “Storm or Spark: Choose your real-time weapon” was originally published by InfoWorld.

Storm vs. Spark Streaming: Side-by-side comparison




Both Storm and Spark Streaming are open-source frameworks for distributed stream processing. But, there are important differences as you will see in the following side-by-side comparison.

Processing Model, Latency

Although both frameworks provide scalability and fault tolerance, they differ fundamentally in their processing model. Whereas Storm processes incoming events one at a time, Spark Streaming batches up events that arrive within a short time window before processing them. Thus, Storm can achieve sub-second latency of processing an event, while Spark Streaming has a latency of several seconds.

Fault Tolerance, Data Guarantees

However, the tradeoff is in the fault tolerance data guarantees. Spark Streaming provides better support for stateful computation that is fault tolerant. In Storm, each individual record has to be tracked as it moves through the system, so Storm only guarantees that each record will be processed at least once, but allows duplicates to appear during recovery from a fault. That means mutable state may be incorrectly updated twice.
Spark Streaming, on the other hand, need only track processing at the batch level, so it can efficiently guarantee that each mini-batch will be processed exactly once, even if a fault such as a node failure occurs. [Actually, Storm’s Trident library also provides exactly once processing. But, it relies on transactions to update state, which is slower and often has to be implemented by the user.]
Storm vs. Spark Streaming comparison.


In short, Storm is a good choice if you need sub-second latency and no data loss. Spark Streaming is better if you need stateful computation, with the guarantee that each event is processed exactly once. Spark Streaming programming logic may also be easier because it is similar to batch programming, in that you are working with batches (albeit very small ones).

Implementation, Programming API


Storm is primarily implemented in Clojure, while Spark Streaming is implemented in Scala. This is something to keep in mind if you want to look into the code to see how each system works or to make your own customizations. Storm was developed at BackType and Twitter; Spark Streaming was developed at UC Berkeley.

Programming API

Storm comes with a Java API, as well as support for other languages. Spark Streaming can be programmed in Scala as well as Java.

Batch Framework Integration

One nice feature of Spark Streaming is that it runs on Spark. Thus, you can use the same (or very similar) code that you write for batch processing and/or interactive queries in Spark, on Spark Streaming. This reduces the need to write separate code to process streaming data and historical data.

Storm vs. Spark Streaming: implementation and programming API.


Two advantages of Spark Streaming are that (1) it is not implemented in Clojure 🙂 and (2) it is well integrated with the Spark batch computation framework.

Production, Support

Production Use

Storm has been around for several years and has run in production at Twitter since 2011, as well as at many other companies. Meanwhile, Spark Streaming is a newer project; its only production deployment (that I am aware of) has been at Sharethrough since 2013.

Hadoop Distribution, Support

Storm is the streaming solution in the Hortonworks Hadoop data platform, whereas Spark Streaming is in both MapR’s distribution and Cloudera’s Enterprise data platform. In addition, Databricks is a company that provides support for the Spark stack, including Spark Streaming.

Cluster Manager Integration

Although both systems can run on their own clusters, Storm also runs on Mesos, while Spark Streaming runs on both YARN and Mesos.
Storm vs. Spark Streaming: production and support.


Storm has run in production much longer than Spark Streaming. However, Spark Streaming has the advantages that (1) it has a company dedicated to supporting it (Databricks), and (2) it is compatible with YARN.

Further Reading

For an overview of Storm, see these slides.

For a good overview of Spark Streaming, see the slides to a Strata Conference talk. A more detailed description can be found in this research paper.

2014 Data Science Salary Survey

Tools, Trends, What Pays (and What Doesn’t) for Data Professionals
Publisher: O’Reilly
Released: November 2014

For the second year, O’Reilly Media conducted an anonymous survey to expose the tools successful data analysts and engineers use, and how those tool choices might relate to their salary. We heard from over 800 respondents who work in and around the data space, and from a variety of industries across 53 countries and 41 U.S. states.

Findings from the survey include:

  • Average number of tools and median income for all respondents
  • Distribution of responses by a variety of factors, including age, location, industry, position, and cloud computing
  • Detailed analysis of tool use, including tool clusters
  • Correlation of tool usage and salary

Gain insight from these potentially career-changing findings—download this free report to learn the details, and plug your own variables into the regression model to find out where you fit into the data space.

John King is a data analyst at O’Reilly Media. Roger Magoulas is O’Reilly’s Research Director.