• Calender

    April 2013
    M T W T F S S
    « Mar   May »
  • Contact

    Send us press releases, events and new product information for publication..

    Email: nchuppala@ outlook.com

What MapReduce can’t do

We discuss here a large class of big data problems where MapReduce can’t be used – not in a straightforward way at least – and we propose a rather simple analytic, statistical solution.

MapReduce is a technique that splits big data sets into many smaller ones, process each small data set separately (but simultaneously) on different servers or computers, then gather and aggregate the results of all the sub-processes to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (non-distributed) designs, if you use 1,000 servers and split the main process into 1,000 sub-processes.

MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.

However, when you need to process variables or data sets jointly, that is 2 by 2 or or 3 by 3, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.

The Problem

Let’s say that your data set consists of n observations and k variables. For instance, the k variables represent k different stock symbols or indices (say k=10,000) and the n observations represent stock price signals (up / down) measured at n different times. You want to find very high correlations (ideally with time lags to be able to make a profit) – e.g. if Google is up today, Facebook is up tomorrow.

You have to compute k * (k-1) /2 correlations to solve this problem, despite the fact that you only have k=10,000 stock symbols. You can not spit your 10,000 stock symbols in 1,000 clusters, each containing 10 stock symbols, then use MapReduce. The vast majority of the correlations that you have to compute will involve a stock symbol in one cluster, and another one in another cluster (because you have far more correlations to compute than you have clusters). These cross-clusters computations makes MapReduce useless in this case. The same issue arises if you replace the word “correlation” by any other function, say f, computed on two variables, rather than one. This is why I claim that we are dealing here with a large class of problems where MapReduce can’t help. I’ll discuss another example (keyword taxonomy) later in this article.

Three Solutions

Here I propose three solutions:

1. Sampling

Instead of computing all cross-correlations, just compute a fraction of them: select m random pairs of variables, say m = 0.001 * k * (k-1) / 2, and compute correlations for these m pairs only. A smart strategy consists of starting with a very small fraction of all possible pairs, and increase the number of pairs until the highest (most significant) correlations barely grow anymore. Or you may use a simulated-annealing approach to decide with variables to keep, which ones to add, to form new pairs, after computing correlations on (say) 1,000 randomly selected seed pairs (of variables).

I’ll soon publish an article that shows how approximate solutions (a local optimum) to a problem, requiring a million time less computer resources than finding the global optimum, yield very good approximations with an error often smaller than the background noise found in any data set. In another paper, I will describe a semi-combinatorial strategy to handle not only 2×2 combinations (as in this correlation issue), but 3×3, 4×4 etc. to find very high quality multivariate vectors (in terms of predictive power) in the context of statistical scoring or fraud detection.

2. Binning

If you can bin your variables in a way that makes sense, and if n is small (say=5), then you can pre-compute all potential correlations and save them in a lookup table. In our example, variables are already binned: we are dealing with signals (up or down) rather than actual, continuous metrics such as price deltas. With n=5, there are at most 512 potential pairs of value. An example of such a pair is {(up, up, down, up, down), (up, up, up,down, down)} where the first 5 values correspond to a particular stock, and the last 5 values to another stock. It is thus easy to pre-compute all 512 correlations. You will still have to browse all k * (l-1) / 2 pairs of stocks to solve you problem, but now it’s much faster: for each pair you get the correlation from the lookup table – no computation required, only accessing a value in a hash table or an array with 512 cells.

Note that with binary variables, the mathematical formula for correlation simplifies significantly, and using the simplified formula on all pairs migh be faster than using lookup tables to access 512 pre-computed correlations. However, the principle works regardless as to whether you compute a correlation, or much more complicated function f.

3. Classical data reduction

Traditional reduction techniques can also be used: forward or backward step-wise techniques where (in turn) you add or remove one variable at a time (or maybe two). The variable added is chosen to maximize the resulting entropy, and conversely for variables being removed. Entropy can be measured in various ways. In a nutshell, if you have two data subsets (from the same large data set),

  • A set A with 100 variables, which is 1.23 GB when compressed, 
  • A set B with 500 variables, including the 100 variables from set A, which is 1.25 GB when compressed

Then you can say that the extra 400 variables (e.g. stocks symbols) in set B don’t bring any extra predictive power and can be ignored. Or in other words, the lift obtained with the set B is so small that it’s probably smaller than the noise inherent to these stock price signals.

Note: An interesting solution consists of using a combination of the three previous strategies. Also, be careful to make sure that the high correlations found are not an artifact caused by the “curse of big data” (see reference article below for details).

Another example where MapReduce is of no use

Building a keyword taxonomy:

Step 1:

You gather tons of keywords over the Internet with a web crawler (crawling Wikipedia or DMOZ directories), and compute the frequencies for each keyword, and for each “keyword pair”. A “keyword pair” is two keywords found on a same web page, or close to each other on a same web page. Also by keyword, I mean stuff like “California insurance”, so a keyword usually contains more than one token, but rarely more than three. With all the frequencies, you can create a table (typically containing many million keywords, even after keyword cleaning), where each entry is a pair of keywords and 3 numbers, e.g.

A=”California insurance”, B=”home insurance”, x=543, y=998, z=11


  • x is the number of occurrences of keyword A in all the web pages that you crawled
  • y is the number of occurrences of keyword B in all the web pages that you crawled
  • z is the number of occurences where A and B form a pair (e.g. they are found on a same page)

This “keyword pair” table can indeed be very easily and efficiently built using MapReduce. Note that the vast majority of keywords A and B do not form a “keyword pair”, in other words, z=0. So by ignoring these null entries, your “keyword pair” table is still manageable, and might contain as little as 50 million entries.

Step 2:

To create a taxonomy, you want to put these keywords into similar clusters. One way to do it is to compute a dissimilarity d(A,B) between two keywords A, B. For instances d(A, B) = z / SQRT(x * y), although other choices are possible. The higher d(A, B), the closer keywords A and B are to each other. Now the big problem is to perform clustering – any kind of clustering, e.g. hierarchical – on the “keyword pair” table, using any kind of dissimilarity. This problem, just like the correlation problem, can not be split into sub-problems (followed by a merging step) using MapReduce. Why? Which solution would you propose in this case?

Interview questions for data scientists

We are now at 75 questions. These are mostly open-ended questions, to assess the technical horizontal knowledge of a senior candidate for a rather high level position, e.g. director.

  1. What is the biggest data set that you processed, and how did you process it, what were the results?
  2. Tell me two success stories about your analytic or computer science projects? How was lift (or success) measured?
  3. What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
  4. What is: collaborative filtering, n-grams, map reduce, cosine distance?
  5. How to optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?
  6. How would you come up with a solution to identify plagiarism?
  7. How to detect individual paid accounts shared by multiple users?
  8. Should click data be handled in real time? Why? In which contexts?
  9. What is better: good data or good models? And how do you define “good”? Is there a universal good model? Are there any models that are definitely not so good?
  10. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation? 
  11. How do you handle missing data? What imputation techniques do you recommend?
  12. What is your favorite programming language / vendor? why?
  13. Tell me 3 things positive and 3 things negative about your favorite statistical software.
  14. Compare SAS, R, Python, Perl
  15. What is the curse of big data?
  16. Have you been involved in database design and data modeling?
  17. Have you been involved in dashboard creation and metric selection? What do you think about Birt?
  18. What features of Teradata do you like?
  19. You are about to send one million email (marketing campaign). How do you optimze delivery? How do you optimize response? Can you optimize both separately? (answer: not really)
  20. Toad or Brio or any other similar clients are quite inefficient to query Oracle databases. Why? How would you do to increase speed by a factor 10, and be able to handle far bigger outputs? 
  21. How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in an SQL-powered RDBMS?
  22. What are hash table collisions? How is it avoided? How frequently does it happen?
  23. How to make sure a mapreduce application has good load balance? What is load balance?
  24. Examples where mapreduce does not work? Examples where it works very well? What are the security issues involved with the cloud? What do you think of EMC’s solution offering an hybrid approach – both internal and external cloud – to mitigate the risks and offer other advantages (which ones)?
  25. Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?
  26. Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
  27. Have you been working with white lists? Positive rules? (In the context of fraud or spam detection)
  28. What is star schema? Lookup tables? 
  29. Can you perform logistic regression with Excel? (yes) How? (use linest on log-transformed data)? Would the result be good? (Excel has numerical issues, but it’s very interactive)
  30. Have you optimized code or algorithms for speed: in SQL, Perl, C++, Python etc. How, and by how much?
  31. Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?
  32. Define: quality assurance, six sigma, design of experiments. Give examples of good and bad designs of experiments.
  33. What are the drawbacks of general linear model? Are you familiar with alternatives (Lasso, ridge regression, boosted trees)?
  34. Do you think 50 small decision trees are better than a large one? Why?
  35. Is actuarial science not a branch of statistics (survival analysis)? If not, how so?
  36. Give examples of data that does not have a Gaussian distribution, nor log-normal. Give examples of data that has a very chaotic distribution?
  37. Why is mean square error a bad measure of model performance? What would you suggest instead?
  38. How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?
  39. What is sensitivity analysis? Is it better to have low sensitivity (that is, great robustness) and low predictive power, or the other way around? How to perform good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
  40. Compare logistic regression w. decision trees, neural networks. How have these technologies been vastly improved over the last 15 years?
  41. Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with? When is full data better than reduced data or sample?
  42. How would you build non parametric confidence intervals, e.g. for scores? (see the AnalyticBridge theorem)
  43. Are you familiar either with extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
  44. What is root cause analysis? How to identify a cause vs. a correlation? Give examples.
  45. How would you define and measure the predictive power of a metric?
  46. How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set – the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?
  47. How to create a keyword taxonomy?
  48. What is a Botnet? How can it be detected?
  49. Any experience with using API’s? Programming API’s? Google or Amazon API’s? AaaS (Analytics as a service)?
  50. When is it better to write your own code than using a data science software package?
  51. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?
  52. What is POC (proof of concept)?
  53. What types of clients have you been working with: internal, external, sales / finance / marketing / IT people? Consulting experience? Dealing with vendors, including vendor selection and testing?
  54. Are you familiar with software life cycle? With IT project life cycle – from gathering requests to maintenance? 
  55. What is a cron job? 
  56. Are you a lone coder? A production guy (developer)? Or a designer (architect)?
  57. Is it better to have too many false positives, or too many false negatives?
  58. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples. 
  59. How does Zillow’s algorithm work? (to estimate the value of any home in US)
  60. How to detect bogus reviews, or bogus Facebook accounts used for bad purposes?
  61. How would you create a new anonymous digital currency?
  62. Have you ever thought about creating a startup? Around which idea / concept?
  63. Do you think that typed login / password will disappear? How could they be replaced?
  64. Have you used time series models? Cross-correlations with time lags? Correlograms? Spectral analysis? Signal processing and filtering techniques? In which context?
  65. Which data scientists do you admire most? which startups?
  66. How did you become interested in data science?
  67. What is an efficiency curve? What are its drawbacks, and how can they be overcome?
  68. What is a recommendation engine? How does it work?
  69. What is an exact test? How and when can simulations help us when we do not use an exact test?
  70. What do you think makes a good data scientist?
  71. Do you think data science is an art or a science?
  72. What is the computational complexity of a good, fast clustering algorithm? What is a good clustering algorithm? How do you determine the number of clusters? How would you perform clustering on one million unique keywords, assuming you have 10 million data points – each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table in the first place?
  73. Give a few examples of “best practices” in data science.
  74. What could make a chart misleading, difficult to read or interpret? What features should a useful chart have?
  75. Do you know a few “rules of thumb” used in statistical or computer science? Or in business analytics?
  76. What are your top 5 predictions for the next 20 years?
  77. How do you immediately know when statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject? For instance, what do you think about the official monthly unemployment statistics regularly discussed in the press? What could make them more accurate?

The 8 worst predictive modeling techniques

Posted by 
Vincent Granville on September 23, 2012 at 11:00am

View Blog

  • Based on my opinion. You are welcome to discuss. Note that most of these techniques have evolved over time (in the last 10 years) to the point where most drawbacks have been eliminated – making the updated tool far different and better than its original version. Typically, these bad techniques are still widely used.

    1. Linear regression. Relies on the normal, heteroscedasticity and other assumptions, does not capture highly non-linear, chaotic patterns. Prone to over-fitting. Parameters difficult to interpret. Very unstable when independent variables are highly correlated. Fixes: variable reduction, apply a transformation to your variables, use constrained regression (e.g. ridge or Lasso regression)
    2. Traditional decision trees. Very large decision trees are very unstable and impossible to interpret, and prone to over-fitting. Fix: combine multiple small decision trees together instead of using a large decision tree.
    3. Linear discriminant analysis. Used for supervised clustering. Bad technique because it assumes that clusters do not overlap, and are well separated by hyper-planes. In practice, they never do. Use density estimation techniques instead.
    4. K-means clustering. Used for clustering, tends to produce circular clusters. Does not work well with data points that are not a mixture of Gaussian distributions. 
    5. Neural networks. Difficult to interpret, unstable, subject to over-fitting.
    6. Maximum Likelihood estimation. Requires your data to fit with a prespecified probabilistic distribution. Not data-driven. In many cases the pre-specified Gaussian distribution is a terrible fit for your data.
    7. Density estimation in high dimensions. Subject to what is referred to as the curse of dimensionality. Fix: use (non parametric) kernel density estimators with adaptive bandwidths.
    8. Naive Bayes. Used e.g. in fraud and spam detection, and for scoring. Assumes that variables are independent, if not it will fail miserably. In the context of fraud or spam detection, variables (sometimes called rules) are highly correlated. Fix: group variables into independent clusters of variables (in each cluster, variables are highly correlated). Apply naive Bayes to the clusters. Or use data reduction techniques. Bad text mining techniques (e.g. basic “word” rules in spam detection) combined with naive Bayes produces absolutely terrible results with many false positives and false negatives.

    And remember to use sound cross-validations techniques when testing models!

    Additional comments:

    The reasons why such poor models are still widely used are:

    1. Many University curricula still use outdated textbooks, thus many students are not exposed to better data science techniques
    2. People using black-box statistical software, not knowing the limitations, drawbacks, or how to correctly fine-tune the parameters and optimize the various knobs, or not understanding what the software actually produces.
    3. Government forcing regulated industries (pharmaceutical, banking, Basel) to use the same 30-year old SAS procedures for statistical compliance. For instance, better scoring methods for credit scoring, even if available in SAS, are not allowed and arbitrarily rejected by authorities. The same goes with clinical trials analyses submitted to the FDA, SAS being the mandatory software to be used for compliance, allowing the FDA to replicate analyses and results from pharmaceutical companies.
    4. Modern data sets are considerably more complex and different than the old data sets used when these techniques were initially developed. In short, these techniques have not been developed for modern data sets.
    5. There’s no perfect statistical technique that would apply to all data sets, but there are many poor techniques.

    In addition, poor cross-validation allows bad models to make the cut, by over-estimating the true lift to be expected in future data, the true accuracy or the true ROI outside the training set.  Good cross validations consist in

    • splitting your training set into multiple subsets (test and control subsets), 
    • include different types of clients and more recent data in the control sets (than in your test sets)
    • check quality of forecasted values on control sets
    • compute confidence intervals for individual errors (error defined e.g. as |true value minus forecasted value|) to make sure that error is small enough AND not too volatile (it has small variance across all control sets)

    Related article:


Five Project Management Mistakes

METHOD 123: empowering managers to succeed

Mistake #3: Not Keeping Schedule Up-to-Date

Many project managers create an initial schedule but then don’t do a good job of updating the schedule during the project. There are trouble signs that the schedule is not being updated.

  • The project manager cannot tell exactly what work is remaining to complete the project.

  • The project manager is unsure whether they will complete the project on-time.

  • The project manager does not know what the critical path of activities is.

  • Team members are not sure what they need to work on next (or even what they should be working on now).

It is a problem when the project manager does not really understand the progress made to date and how much work is remaining. When this happens, the project team is not utilized efficiently on the most critical activities.

There are a couple other common scheduling problems.

  • Infrequent updates. Sometimes the project manager updates the schedule at lengthy intervals. For instance, updating the schedule every two months on a six-month project. This is not often enough to keep control of the schedule. The schedule should be updated every week or two.

  • Managing by percent complete. All activities should have a due date. As you monitor the work, keep focused on whether the work will be completed by the due date. It is not very valuable to know that an activity is 70% completed. It is more valuable to know if the due date will be hit.

  • Assigning activities that are too long. If you assign a team member an activity that is due by the end of the week, you know if the work is on-track when the week is over. However, if you assign someone an activity that does not need to be completed for eight weeks, you have a long time to go before you know if the work is really on schedule. Keep the due dates within a reasonable timeframe. 

It is not easy to catch up a schedule once the project is started. Typically, by the time you realize you need to update the schedule, your project is already in trouble. Updating the schedule at that point only shows how much trouble you are in. The much better approach is to keep the project up-to-date, and ensure that it contains all of work necessary to complete the project. 


Business analytics remains top driver of BI growth

Author:  Comments: 0


Chief information officers are gravitating toward business intelligence (BI) and analysis tools, and for good reason. The benefits of the technology have been both significant and wide ranging, from cutting costs and improving risk management to fraud prevention.

Those advantages have propelled analytics and BI to become the top-cited CIO priority for 2013, according to a recent Gartner study, and it appears that financial leaders are beginning to follow suit

“That ability to measure the performance of past … projects can set the conditions for investment in the future, and without that, skepticism starts to rise up about whether these projects are really worthwhile,” CFO Research editorial director Celina Rogers said recently, according to TechTarget. 

A joint study by CFO Research and AlixPartner revealed that the majority of finance executives were unsatisfied with their companies’ ability to project whether their IT projects would produce a return on investment. 

As competition continues to rise across every industry, forecasting will be an extremely valuable capability. Although many organizations are struggling with this component, AlixPartner managing director Bruce Myers suggested that it isn’t for a lack of available data. 

“I promise you, 99 percent of the time the information is there, [but] pulling that information together into a data warehouse oftentimes requires making things consistent,” he told TechTarget. “But once you have all the data you need in that box, then you can easily answer all these questions.”

Developing analysis strategies
If financial executives and CFOs aren’t happy with their predictive analytics, that likely means they haven’t figured out how to optimize their BI strategies.

At the very least, interest in data analysis appears to be extremely high. A separate Gartner study found that worldwide BI sector will grow another 7 percent in 2013 compared to 2012, as organizations attempt to analyze the data being produced by recent technological innovations. 

In particular, Gartner research vice president Kurt Schlegel stressed that diagnostic, predictive and prescriptive analytics are set to take off, as every company has certain financial-related goals they  want to achieve. Myers told TechTarget that, when developing a BI plan, decision-makers should begin by agreeing on which questions they need to answer. That way, it will be easier to integrate the right information and datasets into their BI solutions. 

At the same time, he suggested that companies target specific goals rather than usinganalysis tools to solve only large-scale problems. 

Is data warehousing holding back the advance of analytics?

By: David Norris, Practice Leader – Analytics, Bloor Research

Published: 26th March 2013
Copyright Bloor Research © 2013

I have worked in data warehousing and analytics since the idea that Business Intelligence solutions had to offer more than reports added onto operational systems became accepted. But of late, I am facing a realisation: By extracting data from the operational environment, and loading it into a business intelligence environment, we introduce limitations that defeat the basis of why we set about doing it in the first place.

In a traditional data warehouse environment you have complex software to extract data, transform it, and reload it into a new location. Increasingly, that procedure takes a very long time, even when you throw very expensive hardware and software at it, because of the sheer volume involved.

When we have the data, the next step is storing it in a third normal form layer, to isolate the data from changes, and make it independent of future change – all very worthy, but again complex and time consuming. No end user can actually use the data in third normal form so we move the data again into a simpler format in a presentation layer, which also takes time.

The net result is that we have tied up immense amounts of intellect and capital to deliver data to the business that is heavily compromised by latency, cost, and difficulty of use.

As data has gotten really big, we have introduced big data solutions such as Hadoop, which remove some of the complexity by avoiding the structured stores, and exploit the capability to deliver affordable scalable solutions using commodity hardware. There is still a lot of complexity in the solution, because the means of extracting value require MapReduce programming, which is still an arcane skill and not for the average user.

So, after twenty years, I am starting to think the orthodox solutions have run their course and we need to think differently. So when I see something like the Pervasive combination of RushAnalytics and DataRush I am starting to see a solution that offers light at the end of the tunnel. What is required is something to enable Business Intelligence to provide business with insight quickly and affordably. That means commodity hardware, not expensive technical solutions, and software that supports rapid iterative development using visual interfaces, and not complex, arcane programming skills. So we need something that is fast, easy to use, and affordable. If we can tick those boxes, we are starting to get to a position where we can keep up with the demand that the business has for analytics, at a price that makes it economically feasible.

What Pervasive is offering is a platform that offers data access, transformation, analysis, visualisation and delivery using KNIME, an open source visual data analysis workflow environment, on top of Pervasive DataRush parallel dataflow architecture. This means that with Pervasive RushAnalytics, you do not have to move the data into a specific data store before you can start to analyse it. Now domain experts can start to gain insight from data in time spans that are in a different league to the days that traditional analysis takes, and is being achieved on commodity technology. This offers what business really craves – speed of return on investment that is measured in hours, or even minutes, not weeks!

KNIME offers the tools to address the data mining tasks that are required for risk management, fraud detection, and superior decision support that includes association rules, classifiers, clustering, principal component analysis and regression – all of the things that are key to effective data mining, and all via a graphical interface, so it’s point and click, not code and sweat. That workflow is then executed on the highly parallel Pervasive DataRush processing engine.

When I first came across DataRush a couple of years ago, I thought it was the best-kept secret in the IT industry. It is designed to enable code to work in parallel across multiple cores without having to redesign things to exploit the additional cores as we move from a single threaded environment up through the various permutations of twin core, quadruple, eight core, sixteen core etc. DataRush detects the number of cores and nodes available at runtime, and adjusts the processing workflow to exploit them, so its model is “Build once and run on whatever,” – total future proofing.

I am hoping that this is a sign that we are ending the world, as we know it, where analytics is held back by the technology we make it run on. We enter a new era in which analytics can run unfettered and deliver the returns that we all crave, which is an exciting prospect.