• Calender

    March 2015
    M T W T F S S
    « Feb   Apr »
  • Contact

    Send us press releases, events and new product information for publication..

    Email: nchuppala@ outlook.com

Smart Big Data: The All-Important 90/10 Rule

The sheer volumes involved with Big Data can sometimes be staggering. So if you want to get value from the time and money you put into a data analysis project, a structured and strategic approach is very important.

The phenomenon of Big Data is giving us ever-growing volume and variety of data we which we can now store and analyze. Any regular reader of my posts knows that I personally prefer to focus on Smart Data, rather than Big Data – because the term places too much importance on the size of the data. The real potential for revolutionary change comes from the ability to manipulate, analyze and interpret new data types in ever-more sophisticated ways.

Application of the Pareto distribution and 90/10 rule in a related context

The SMART Data Framework

I’ve written previously about my SMART Data framework which outlines a step-by-step approach to delivering data-driven insights and improved business performance.

  1. Start with strategy: Formulate a plan – based on the needs of your business
  2. Measure metrics and data: Collect and store the information you need
  3. Apply analytics: Interrogate the data for insights and build models to test theories
  4. Report results: Present the findings of your analysis in a way that the people who will put them into effect will understand
  5. Transform your business

Understand your customers better, optimize business processes, improve staff wellbeing or increase revenues and profits.

My work involves helping businesses use data to drive business value. Because of this I get to see a lot of half-finished data projects, mothballed when it was decided that external help was needed.

The biggest mistake by far is putting insufficient thought – or neglecting to put any thought – into a structured strategic approach to big data projects. Instead of starting with strategy, too many companies start with the data. They start frantically measuring and recording everything they can in the belief that big data is all about size. Then they get lost in the colossal mishmash of everything they’ve collected, with little idea of how to go about mining the all-important insights.

This is why I have come up with the 90/10 rule – When working with data, 90% of your time should be spent on a structured strategic approach, while 10% of your time should be spent “exploring” the data.

The 90/10 Rule

The 90% structured time should be used putting the steps outlined in the SMART Data framework into operation. Making a logical progression through an ordered set of steps with a defined beginning (a problem you need to solve), middle (a process) and an ending (answers or results).

This is after all why we call it Data Science. Business data projects are very much like scientific experiments, where we run simulations testing the validity of theories and hypothesis, to produce quantifiable results.

The other 10% of your time can be spent freely playing with your data – mining for patterns and insights which, while they may be valuable in other ways, are not an integral part of your SMART Data strategy.

Yes, you can be really lucky and your data exploration can deliver valuable insights – and who knows what you might find, or what inspiration may come to you? But it should always play second-fiddle to following the structure of your data project in a methodical and comprehensive way.

Always start with strategy

I think this is a very important point to make, because it’s something I often see companies get the wrong way round. Too often, the data is taken as the starting point, rather than the strategy.

Businesses that do this run the very real risk of becoming “data rich and insight poor”. They are in danger of missing out on the hugely exciting benefits that a properly implemented and structured data-driven initiative can bring.

Working in a structured way means “Starting with strategy”, which means identifying a clear business need and what data you will need to solve it. Businesses that do this, and follow it through in a methodical way will win the race to unearth the most valuable and game-changing insights.

I hope you found this post interesting. I am always keen to hear your views on the topic and invite you to comment with any thoughts you might have.

About : Bernard Marr is a globally recognized expert in analytics and big data. He helps companies manage, measure, analyze and improve performance using data.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Bette… You can read a free sample chapter here

What is network government?

Network government is the transition to a more transparent, cooperative and beneficial relationship between government, citizens and business as a result of technological integration and organisational connectivity.

By enabling active stakeholder participation and access, it transforms government services by placing users at the centre of policy design and implementation and service delivery.



Enhancing citizen participation and public sector connectedness through user experience testing and design

SOURCE : NetworkedGovernment-OxfordAnalytica (1)

Over 80% of Finnish citizens use online services to interact with and inform policy in the public sector. This adoption rate represents, in many respects, ‘fullspectrum’citizen-centrism and is rooted in a strong tradition of open, participatorygovernment: concepts of citizen-centric policy design are found throughout the Finnish Constitution, the Local Government Act and other legislation as far back as the early post-war period. Additionally, Finnish citizens are typically very receptive to change and modernisation, making innovation in governance more readily adopted by the public.


To sustain its robust levels of participation the Finnish government has instituted a series of reforms, many of which are comparatively low-tech. One initiative is a crowd-sourced action plan, developed in part by the Ministry of Justice, which coordinates work across government agencies to identify best practices for enhancing input from the public when designed policies. According to a policy paper developed in conjunction with the Open Government Partnership, Finland’s “Action Plan on Open Government” focuses on three core areas:
― Open Knowledge. Make as much information available as possible, in a way that protects privacy and security.
― Open Procedures. Make policymaking transparent, and invite participation in the process.
― Clear Language. Ensure that citizens are capable of interpreting and comprehending policy language.


The goal of this effort is to facilitate increased citizen participation in the policymaking process by lowering the barriers to entry through simplified language and processes. Today, major policies are increasingly dependent on citizen  input and the most ambitious project will involve citizens in the budget-setting process. Although participatory budgeting is not a new concept, Finland’s process will allow citizens to use open data to inform their own budget-setting  priorities, which policymakers will use to help inform national policy.

Is Spark The Data Platform Of The Future?

Hadoop has been the foundation for data programmes since Big Data hit the big time. It has been the launching point for data programmes for almost every company who is serious about their data offerings.

However, as we predicted we are seeing that the rise in in-memory databases has seen the need for companies to adopt frameworks that harness this power effectively.

It was therefore no surprise that Apache have launched Spark, a new framework that utilizes in-memory primitives to deliver performance around 100 times faster than Hadoop’s two-stage disk-based version.

This kind of product has become increasingly important as we move forward into a world where the amount and speed of data has been increasing exponentially.

So is Spark going to be the Hadoop beater that it seems to be?


This kind of technology that allows us to make decisions quicker and with increased amounts of data is going to be something that companies are clamouring for.

It is not simply in principle that this platform will be bringing about change either. As an open source platform, it has the most developers working on it across every Apache product.

This suggests that people support the idea through their willingness to dedicate their time to it. It is common knowledge that many of the data scientists working on Apache products are the same ones who will be using it in their day-to-day roles at different companies, which could suggest that they are going to adopt this system in the future.


One of the main reasons for the success of Hadoop in the last few years has been not only due to its ease of use, but also that companies can get it for nothing. This is because you can run the basics of Hadoop on a regular system and will only need to upgrade when they ramp up their data programmes.

Spark runs on-memory systems which requires a system with high performance, something that companies new to data initiatives are unlikely to invest in.

So which is it more likely to be?

In my opinion, Hadoop will always be the foundation of data programmes and with more companies looking at adopting it as the basis for their implementations, this is unlikely to change.

Spark may well become the upgrade that companies who move to a stage where they want, or need, improved performance will adopt. As Spark can work alongside Hadoop this seems to have also been in the minds of the guys at Apache when coming up with the product in the first place.

Therefore, it is unlikely to be a Hadoop beater, but will instead become more like its big brother. It is capable of doing more, but at increased cost and only necessary for certain data volumes and velocities, is not going to be a replacement.

How to analyze 100 million images for $624

There’s a lot of new ground to be explored in large-scale image processing.

Jetpac is building a modern version of Yelp, using big data rather than user reviews. People are taking more than a billion photos every single day, and many of these are shared publicly on social networks. We analyze these pictures to discover what they can tell us about bars, restaurants, hotels, and other venues around the world — spotting hipster favorites by the number of mustaches, for example.

mustache photo

Treating large numbers of photos as data, rather than just content to display to the user, is a pretty new idea. Traditionally it’s been prohibitively expensive to store and process image data, and not many developers are familiar with both modern big data techniques and computer vision. That meant we had to cut a path through some thick underbrush to get a system working, but the good news is that the free-falling price of commodity servers makes running it incredibly cheap.

I use m1.xlarge servers on Amazon EC2, which are beefy enough to process two million Instagram-sized photos a day, and only cost $12.48! I’ve used some open source frameworks to distribute the work in a completely scalable way, so this works out to $624 for a 50-machine cluster that can process 100 million pictures in 24 hours. That’s just 0.000624 cents per photo! (I seriously do not have enough exclamation points for how mind-blowingly exciting this is.)

The Foundations

So, how do I do all that work so quickly and cheaply? The building blocks are the open source Hadoop, HIPI, and OpenCV projects. Unless you were frozen in an Arctic ice-cave for the last decade, you’ll have heard of Hadoop, but the other two are a bit less famous.

HIPI is a Java framework that lets you efficiently process images on a Hadoop cluster. It’s needed because HDFS can’t handle large numbers of files, so it provides a way of bundling images together into much bigger files, and unbundling them on the fly as you process them. It’s been growing in popularity in research and academia, but hasn’t had widespread commercial use yet. I had to fork it and add meta-data support so I could keep track of the files as they went through the system, along with some other fixes. It’s running like a champion now, though, and has enabled everything else we’re doing.

OpenCV is written in C++, but has a recently added Java wrapper. It supports a lot of the fundamental operations you need to implement image-processing algorithms, so I was able to write my advanced (and fairly cunning, if I do say so myself, especially for mustaches!) image analysis and object detection routines using OpenCV for the basic operations.


The first and most time-consuming step is getting your images downloaded. HIPI has a basic distributed image downloader as an example, but you’ll want to make sure the site you’re accessing won’t be overwhelmed. I was focused on large social networks with decent CDNs, so I felt comfortable running several servers in parallel. I did have to alter the HIPI downloader code to add a user agent string so the admins of the sites could contact me if the downloading was causing any problems.

If you have three different external services you’re pulling from, with four servers assigned to each service, you end up taking about 30 days to download 100 million photos. That’s $4,492.80 for the initial pull, which is not chump change, but not wildly outside a startup budget, especially if you plan ahead and use reserved instances to reduce the cost.

Object Recognition

Now you have 100 million images, you need to do something useful with them. Photos contain a massive amount of information, and OpenCV has a lot of the building blocks you’ll need to extract some of the parts you care about. Think about the properties you’re interested in — maybe you want to exclude pictures containing people if you’re trying to create slideshows about places — and then look around at the tools that the library offers; for this example you could search for faces, and identify photos that are faceless as less likely to contain people. Anything you could do on a single image, you can now do in parallel on millions of them.

Before you go ahead and do a distributed run, though, make sure it works. I like to write a standalone command-line version of my image processing step, so I can debug it easily and iterate fast on the algorithm. OpenCV has a lot of convenience functions that make it easy to load images from disk, or even capture them from your webcam, display them, and save them out at the end. Their Java support is quite new, and I hit a few hiccups, but overall, it works very well. This made it possible to write a wrapper that is handed the images from HIPI’s decoding engine, does some processing, and then writes the results out in a text format, one line per image, all within a natively-Java Hadoop job.

Once you’re happy with the way the algorithm is working on your test data, and you’ve wrapped it in a HIPI wrapper, you’re ready to apply it to all the images you’ve downloaded. Just spin up a Hadoop job with your jar file, pointing at the HDFS folders containing the output of the downloader. In our experience, we were able to process around two million 700×700-sized JPEG images on each server per day, so you can use that as a rule of thumb for the size/speed tradeoffs you want to make when you choose how many machines to put in your cluster. Surprisingly, the actual processing we did within our algorithm didn’t affect the speed very much, apparently the object recognition and image-processing work ran fast enough that it was swamped by the time spent on IO.

I hope I’ve left you excited and ready to tackle your own large-scale image processing challenges. Whether you’re a data person who’s interested in image processing or a computer-vision geek who wants to widen your horizons, there’s a lot of new ground to be explored, and it’s surprisingly easy once you put together the pieces.

21 Ways to Excel at Project Management

Learn with this book, written in a question and answer style, containing 21 pieces of valuable advice for making your project success