• Calender

    August 2014
    M T W T F S S
    « Jun   Sep »
  • Contact

    Send us press releases, events and new product information for publication..

    Email: nchuppala@ outlook.com

Getting the CMO and CIO to work as partners

McKinsey McKinsey & Company

To turn new technologies into profits and growth, marketing and IT will need to change how they work—and how they work together.

August 2014 | byMatt Ariker, Martin Harrysson and Jesko Perrey
A global company recently decided to do what many companies are doing: figure out how to turn big data into big profits. It put together a preliminary budget and a request for proposal that in effect asked vendors to take the data the company had and identify opportunities.

Vendors were thrilled with what was essentially a free pass to collect and analyze everything (with due regard for customer privacy concerns, of course). Two months later, the bids were coming in 400 percent over budget. The obvious solution was to narrow the scope, but no one was sure what to cut and what to keep because the chief marketing officer (CMO) hadn’t specifically defined the most important data requirements, and the CIO hadn’t reviewed the request for proposal or intervened to prevent the inevitable above-budget bids. Months of wasted time and spending later, the company is no closer to a big data plan.

Variations of this big data storyline are playing out in executive offices around the world, with CMOs and CIOs in the thick of it. CMOs, who are responsible for promoting growth, need the CIOs’ help to turn the surfeit of customer data their companies are accumulating into increased revenue. CIOs, obliged to turn new technology into revenue, need the CMOs to help them with better functional and technical requirements for big data initiatives.

The situation reflects a central truth in today’s big data world: both the CMO and CIO are on the hook for turning all that data into growth together. It may be a marriage of convenience, but it’s one that CMOs and CIOs need to make workespecially as worldwide volume of data is growing at least 40 percent a year, with ever-increasing variety and velocity. That’s why many CMOs are waking up to the fact that IT can’t be treated like a back-office function anymore; rather, the CIO is becoming a strategic partner who is crucial to developing and executing marketing strategy.

Companies that are more data driven are 5 percent more productive and 6 percent more profitable than other companies.1 Given the $50 billion that marketers already spend on big data and analytics capabilities annually, the pressure is on to show significant above-market returns for that investment.2
The big data and advanced analytics systems needed to capture that return don’t follow the traditional sequential path of requirements gathering, building, testing, and deployment. They involve new architectures for data aggregation, coupled with rapid experimentation, iteration, and evolution of functionality. They demand a new way of working that is likely unfamiliar to both CMOs and CIOs.

New realities
More and more, CMOs and CIOs are seeing that they are natural partners: CMOs have an unprecedented amount of customer data, from which they need to extract insights to increase revenue and profits. The CIO has the expertise in the development of IT architectures and the execution of large programs needed to create the company’s big data backbone and generate the necessary insights.

Historically, though, the relationship has often been a fractious one. CMOs have traditionally acted as stewards of the brand and have focused on large creative campaigns that generate excitement for the company’s products or services. The CIO, on the other hand, has primarily focused on a combination of business-process improvement (for example, in order-to-cash work flows) and “keeping the lights on” by managing core transaction systems, ensuring cybersecurity, supporting end users, and reducing costs.

The digital explosion has forced CMOs and CIOs to work more closely together (see sidebar, “How a technology company benefited from cooperation”). But that hasn’t always made them work better together. As the mix of IT spending shifts from the back office and supply-chain management (for those industries that have a supply chain) to the front office and customer engagement, tensions may arise about the CMO’s and CIO’s decision rights and budget authority. These tensions are reflected in research suggesting that most CMOs today see marketing as the natural leader of big data efforts, while most CIOs see IT in that role.3

How a technology company benefited from cooperation

The demands of speed and agility are an important operational source of friction. As changes in customer behavior, technology, and the business environment accelerate, marketers need fast-adapting systems. But for IT, the need for speed can be a massive shift, often requiring the function to retool its operating model in order to quickly deliver analytic systems that drive better decision making.

It’s true that technology is helping to mitigate the speed and agility issue. Big data analytics platforms, for example, allow companies to make sense of the data they have, wherever the data are, without having to pull them all into one place and make costly investments in unnecessary storage. Well-established cloud and open-source technologies permit rapid development and iteration of pilot projects and narrowly focused initiatives. And more recent technological developments are allowing companies to quickly combine unstructured and structured data within a common framework. At the same time, public and private clouds have created significant new opportunities to store, analyze, and serve data without requiring reduced stand-up time4 and costly investments.

However, as illustrated by the challenges of the global company described earlier, technology is not enough. What’s needed is a practical approach for creating a workable partnership.

How to work better together
Many observers have correctly noted that today’s CMO must master not just the art of strategy and creativity but also the science of analytics to identify and capture revenue opportunities. What that means in practice is that CMOs must have a passion for facts and measurement, the ability to discern the specific business opportunities that big data presents, and a clear vision of how to capture the opportunities and accelerate performance. Most important, they must be able to define their vision with precision from the beginning of data analysis to the delivery of a solution to the front lines to the tracking of earnings impact.5
The CIO, on the other hand, must shift IT from being a cost center to being a business-revenue facilitator and enabler. In the big data era, the CIO is accountable for using technical infrastructure to enable and accelerate revenue growth. We believe that a large portion of CEOs expect IT to use the cloud for innovations that create value rather than as a way to increase IT productivity. CIOs must have a keen sense that technology is a means to achieve business ends and use sophisticated analytics to make business cases.

It’s easy to say that the CMO and CIO, and sometimes the CTO, should share leadership of the overall analytics effort and a mutual definition of its success. But that agreement needs to be followed quickly by the next stage: having shared accountability for business-performance improvement based on specific key performance indicators such as revenue generation, usage, and retention.

One of the most important factors for success in big data and advanced analytics, for instance, is to understand specifically what you want. When you’re looking for a needle in the haystack of big data, you really need to know what a needle looks like. A successful partnership, therefore, requires that the CMO be able to define business goals and use cases6 (a method for gathering the functional requirements of applications) of any data or analytics initiative. The CIO should provide feasibility and cost analytics regarding requirements based on use cases. That involves articulating trade-offs and options by measures such as cost, time, and priorities.

When the CMO and CIO are working together on governance and use-case development, they need to overcome a common stumbling block: the lack of a shared vocabulary or understanding of what is expected. Marketers and technology people speak very different languages, so there’s a need on both sides to become bilingual. To the CMO’s mind, defining use cases, for example, should involve writing a few clear sentences. The CIO, on the other hand, might expect ten pages. Frustration will erupt unless both the CMO and CIO take the time to bridge the expectations gap.

Avoiding these pitfalls requires the CMO and CIO to really invest in building the partnership. One company has taken the step of having the CMO’s and CIO’s offices on the same floor. At another company, the CMO and CIO host a joint dinner for their managers each quarter with the explicit goal of building camaraderie and trust within their teams.

Collaboration and coordination should involve marketing and IT organizations at large as well. Teams made up of people from both functions should define the data-use requirements with precision to ensure the proper build-out of the analytics infrastructure. These integrated teams should sit together to review, analyze, and act on the data. When good results are achieved, the team should get visible credit for example, public announcements at large meetings or e-mails to relevant groups from the CMO and CIO. Just as important, the CMO and CIO should find, nurture, and reward people with leadership qualities that foster successful cross-functional collaboration. Those traits often include empathy and the ability to broker agreements and resolve points of conflict constructively.

All this requires important shifts in mind-set for both marketers and IT specialists. Marketers must use their specialized expertise and experience to help IT analytics teams question assumptions and pressure test outcomes. At the same time, IT should develop more of a customer-service mentality, including listening closely to what the marketing team wants, acting as a thought partner to develop solutions, and constantly checking to see if solutions have been effective (and updating them if needed). CIOs have a critical role in helping CMOs understand software-development trade-off decisions and opportunity costs. Critically, data should be viewed as an enterprise asset rather than a departmental asset, as is too often the case. This broader view of data can help the CMO and CIO develop insights that deliver greater value to the business.

Prerequisites for success
To make their partnership work, the CMO and CIO should ensure that five prerequisites are in place and have the CEO’s explicit support for them.

Be clear on decision governance

An effective decision-governance framework makes clear how the CIO and CMO, and potentially other C-level executives as well as their respective leadership teams, must work together and support each other. This is much more far reaching than a data-governance framework, as it covers every stage in the journey of translating data into value, from setting strategy to constructing use cases, allocating funds, and deploying capabilities. Teams should be explicit about when decisions are needed, what must be decided, and who is responsible for making them. To bring the right stakeholders together, one company has developed a “business-transformation council” to tackle governance and operating-model design. Whatever structural approach is selected, this typically demands compromises on all sides to achieve clarity and specificity on roles, but the benefits of alignment in accelerating decision making and avoiding wasted work make it worthwhile.

Build the right teams

The two executives must lead a common agenda for defining, building, and acquiring advanced analytics capabilities. In our experience, that often requires the creation of a center of excellence7 where both marketing and IT people work together. They must also agree where those critical capabilities will be located in the center of excellence or distributed across functions and locations what the lines of reporting are, and which budget will pay for them. To help make these decisions, the trick is to map the stages of the big data value chain from data architecting to delivery of customer offers and describe the necessary capabilities and responsibilities for each stage. Next, roles should be assigned to each stage, with the understanding that there may need to be multiple roles for a given stage and that they will often require someone from both IT and marketing. One important lesson that an insurer learned after bringing its marketing and IT organizations closer together was that in big data–oriented companies, skill sets become indistinguishable in business units, marketing, and IT.

Provide transparency

The CMO and CIO (and potentially the CTO) must bring transparency to the process. Not only must they sit down at the start to define data-use requirements with precision, but they must also meet regularly biweekly or monthly to review progress and keep the effort on track. Each quarter, they should have a frank discussion about the CMO–CIO relationship and how to strengthen and sustain it. One approach is to develop a scorecard that tracks project progress and identifies breakdowns. Addressing these issues cannot be about assigning blame; that would quickly create a toxic work environment. It should be about having clear accountability and working collectively to fix any problems.

Hire IT and marketing ‘translators’

Goodwill, effort, and clarity will go a long way to bring the CMO and CIO together. But the reality is that few CMOs or CIOs have the right balance between business and technology. What each needs to do is hire “translators.” The CMO should hire someone who understands customers and business needs but speaks the language of IT. The CIO needs to hire technical people with a strong grounding in marketing campaigns and the business side. Business-solution architects, for example, put all the discovered data together and organize them so that they’re ready to analyze. They structure the data so they can be queried in meaningful ways and appropriate time frames by all relevant users.

One software company has a business-information officer for each business unit and the marketing function. This manager must understand and translate business strategy into a joint IT–enterprise architecture strategy and a technology-investment portfolio for each business unit. The team of business-information officers also supports the CIO and IT organization on topics such as IT governance and security to ensure compliance across the marketing organization.

Learn to drive before you fly

The CMO and CIO should not expect to get all aspects of the model right the first time. Instead, they should focus on a few pilots to test team compositions and new processes for collaboration. This approach allows teams to develop best practices and learn valuable lessons that can then be used to train other teams. One such lesson: don’t be afraid to fail, but keep the projects and teams small enough at first to both fail and learn quickly.

Effective use of big data and other technologies is already separating the winners from the losers, and the CMO and CIO share responsibility for the outcome. Forging a winning relationship between marketing and IT isn’t easy, but it can be done by being clear on decision governance, building the right teams, and ensuring transparency.

For an executive’s perspective on collaboration between marketing and IT, watch a video interview with Nationwide CMO Matt Jauchius on mckinseyonmarketingandsales.com.

About the authors

Matt Ariker is the chief operating officer of McKinsey’s Consumer Marketing Analytics Center and is based in the San Francisco office, Martin Harrysson is an associate principal in the Silicon Valley office, and Jesko Perrey is a director in the Düsseldorf office.

Becoming A Data Scientist: What A Data Scientist ISN’T



Community posts are submitted by members of the Big Data Community and span a range of themes. If you would like to contribute to the blog, just register to join the community.

If you’re reading this you probably already have an inkling of what a data scientist is. Have you ever considered what a data scientist isn’t? According to Vincent Granville, author of Developing Analytic Talent: Becoming a Data Scientist, data scientists are:

  • Not statisticians
  • Not data analysts
  • Not computer scientists
  • Not software engineers
  • Not business analysts

Data scientists do have some knowledge in each of these areas but also some outside of these areas.

NEITHER STATISTICIANS NOR DATA ANALYSTS: One reason the gap between statisticians and data scientists has grown over the last 15 years is that academic statisticians, who publish theoretical articles (sometimes not based on data analysis) and train statisticians, are… not statisticians anymore. Also, many statisticians think that data science is about analyzing data. But it is so much more than that! Over time, as statisticians catch up with big data and modern applications, the gap between data science and statistics will shrink.

NOT COMPUTER SCIENTISTS: First, data scientists are not computer scientists, because they don’t need the entire theoretical knowledge computer scientists have, and second, because they need to have a much better understanding of random processes, experimental design, and sampling – typically areas in which statisticians are expert. BUT data scientists DO need to be familiar with computational complexity, algorithm design, distributed architecture, and programming (R, SQL, NoSQL, Python, Java, and C++).

NOT SOFTWARE ENGINEERS: Data scientists do need to be domain experts in one or two applied domains.

NOT BUSINESS ANALYSTS: Data scientists don’t need to be MBAs, necessarily, but they do need to have success stories to share (with metrics used to quantify success), have strong business acumen, and be able to assess the ROI that data science can bring to their clients or their boss.

Data scientists do need to be good communicators to understand, and many times guess, what problems their client, boss, or executive management is trying to solve. Translating high-level English into simple, efficient, scalable, replicable, robust, flexible, platform-independent solutions is critical.

Learn more about what a data scientist is and isn’t by accessing a complimentary chapter fromDeveloping Analytic Talent: Becoming a Data Scientist

5 Tips to Ensure Suc​cess with BI​



Identifying and selecting a Business Intelligence platform is not as  onerous as it may appear at first. There is a straightforward process, but this does not mean you should take the process lightly. A successful BI project will drive remarkable business value. For example, organizations have seen five times increases in inventory turns or 20% gains in incremental revenue after deploying BI platforms. However, the failure rate of BI projects remains high; according to Gartner, 70 to 80% of all BI projects fail. So, how do you avoid failing? Here is your guide to ensuring your project is among the 20-30% of successful implementations. READ MORE

1 Establish a partnership, do not buy a product.

You are not buying a technology product, you are engaging in a business relationship with another company. You are about to make an investment of time and money to improve the competitive position of your business, so approach
the project with that goal. Evaluate the company with whom you are doing business beyond their product offering. Understand that a multiple vendor solution means maintaining partnerships with multiple folks. Here are some
key items to address during the buying cycle, beyond product functionality: • Is this a business and person with whom you want to do business in future? Can they help you improve your business? Is the company willing to work within your constraints and needs?
• Does their enablement process and support policy match your business?
• Does the company have a reputation for helping customers achieve business benefits? This can be ascertained through industry analyst reviews and 3rd party metrics such as Net Promoter Score.
• Does your team’s skill set align with that provided by the platform? What incremental resources will you need to maintain and run the platform?

2 Solve business problems, not data problems.

Two primary reasons BI projects fail is that the initial requirements do not drive specific business value or the requirements change and the platform cannot adapt. A set of cool dashboards will not win you a promotion, but improving your business competitively will. To be successful, establish the technology requirements for your BI platform based on business requirements. Ask yourself what business metrics will this project improve? Plan for the reality that your business will change and grow and so, too, will your requirements.

3 Get executive buy-in.

Create a business case for the platform purchase, based on your business goal. Map out a total cost of ownership model and business benefit metric (cost reduction or revenue improvement) over at least a 3 year period. Get executive agreement on the business plan, purchase process, decision maker, budget, and timeline.

4 Build a short-list.
From your research and requirements, establish a short-list of vendors. Reach out to the vendors, get demos, ask questions about functionality and services. Ask for third party research that validates what they tell you and their capabilities and market position. If possible, ask if they will share Net Promoter Scores (NPS) with you. Be sure to give vendors specific use cases and requirements; and issue an RFP if desired. The goal: find out if this company (and platform) will enable you 
to achieve your specific business goal. Download 11 Key Questions to Ask of a BI Solution to get started building a short list of vendors that are best suited to meet your BI needs.

5 Make a selection.

Get to know the target vendors on your shortlist. Ask detailed questions and meet  other employees besides the sales team. You may want to do a workshop where each company shows their platform performing one of your specific use case 
scenarios.Do a Total Cost of Ownership (TCO) comparison. Pricing and packaging among vendors differs considerably and license costs are only a component of the overall costs required to run the software. Ensure you understand the full cost
of deploying, maintaining, and running the platform, including people, software, hardware, and upgrades. Evaluate vendors on several factors. While not necessarily product specific, there are numerous other factors that should be taken into account. These include company character, user experience, service and support, company viability, total cost of ownership, security, and flexibility. For a comprehensive approach to procuring Business Intelligence, download the BI Buyer’s Kit here.


How do you define a machine learning problem

Jason Brownlee Jason@machinelearningmastery.com

A danger with applied machine learning is diving in and running algorithms on the dataset. It’s dangerous because your understanding of the problem is limited which will in turn limit the results you can get.

There are at least two things you want to do before that point and the first is to clearly define the problem. It takes some discipline, but it really pays off in getting great results and understanding what they mean.

Like you, I usually just want to dive in. To speed up the process, I like to understand the problem a little bit from many different perspectives.

I use a 3 step approach:

– Step 1: What is the problem? (capture descriptions, formalisms and assumptions)
– Step 2: Why does the problem need to be solved? (motivation, benefits, and form of the solution)
– Step 3: How would I solve the problem? (how to solve the problem manually)

This last step helps you to understand if and why the problem is complex and requires a machine learning based solution.

Read more about this strategy here http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

I’ll speak to you soon.


PS. Got a question? Reply to this email, it goes straight to my inbox.

You are receiving this email because you opted in at MachineLearningMastery.com

Machine Learning Mastery

4 Sarah Court
Vermont, VICTORIA 3133

Lead by Doing, Not by Delegating


It’s just another day in leadership paradise. An important project is languishing—like a bad houseguest, it’s going nowhere, but no one is calling it out. As the head of your team, do you take the matter into your own hands and get the job done, or continue to slough it off on an unfortunate subordinate?

We’ve all been there. Most often, the initiatives that tend to suffer are the strategic initiatives of the important-but-not-urgent variety. (After all, if they were pressing, they’d be on track, receiving the attention they deserve.) Everyone involved is frustrated, especially the rock star assigned to lead the effort. Every status meeting is like a scene from the movie Groundhog Day, with the (perhaps now former) rock star reporting lots of activity (but little progress) and serving up a revised plan with new activities and (extended) timelines. Something has to change. The question is, what?

Let’s consider a real-life situation facing a senior executive: Carrie (not her real name) identified the need for significant culture change within her organization. She delegated the effort to Adam (not his real name), one of her best and brightest, and met with him on a monthly basis to provide direction and input. But after a year of hard work surveying, analyzing, and framing the initiative, Carrie and Adam had generated more paper than progress, and the organization’s top leaders still couldn’t agree on what needed to change or how to proceed.

What’s causing the problem? Carrie delegated a challenge, something she didn’t fully understand, to someone equally clueless. Organizations struggle to manage strategic initiatives because they’re typically broad in scope and scale and full of unknowns, making it difficult to successfully navigate from inception to completion in a straight line. When Adam received the assignment, he appreciated Carrie’s vote of confidence and worked hard to reflect that confidence in how he learned and attempted to lead. Carrie misread Adam’s confidence for competence. Neither one of them knew what they did not know, but assumed the other one did—a classic case of the blind leading the blind.

Professional challenges come in all shapes and sizes, whether they’re related to strategy making, innovation, process improvements, strategic partnerships, mergers, or technology-enabled transformations. When hard problems are delegated and success is elusive, the go-to reaction is to blame the people involved—starting with the aforementioned rock star. But when princes start looking like pigs, leaders should look in the mirror to see the face of the person who is to blame and who should have led the project all along.

Delegating difficult issues is tempting, but it can only lead to disappointment. Leaders shouldn’t assume that all projects can be assigned to others in the same manner. Day-to-day operational work is safely delegated using the traditional methods of assigning accountability, establishing target outcomes, and monitoring progress. But strategic, change-oriented initiatives require hands-on leadership by senior executives who have the passion, perspective, and power to pull it off.

In this instance, executives must truly be leaders, rather than just sponsors. Sponsorship is a watered-down version of leadership, hallmarked by monthly attendance at well-scripted steering committee meetings. Leadership of hard problems is a hands-on, roll-up-your-sleeves, messy job—a set of skills often left behind as executives move up the organizational food chain and away from the day-to-day work routine.

Executives must truly be leaders, rather than just sponsors.

Fortunately, Carrie realized the error of her wayward delegation in time to save the demoralized Adam. She restarted the culture change initiative, put herself in place as the leader, secured expert counsel to provide guidance (for her and her direct reports), and positioned Adam as a staff resource.

According to Patrick Lencioni, leaders like Carrie, who have limited bandwidth, must start by asking themselves what is most important—right now. This singular, temporary rallying cry serves to focus and unify the leader, her team, and, ultimately, the entire organization. By defining the most important strategic initiative for any given period of time, leaders can avoid overwhelming themselves or their organizations with change that cannot be effectively conceived or implemented.

In the world of hypercompetitive market conditions, specialists, and experts, retaining a difficult project can seem a bit counterintuitive. Delegation can be successful when the work is well understood. Otherwise, it’s merely an abdication of responsibility. Don’t fall into its all-too-easy trap. Doing so not only wastes time and money; it also fosters organizational cynicism rather than success.

Will Spark, Google Dataflow Steal Hadoop’s Thunder?

Apache Spark and Google’s Cloud Dataflow service won’t kill Hadoop, but they’re aimed at the high-value role in big data analysis.

Google captured the big data community’s attention last week by announcing Google Cloud Dataflow, a service that replaces MapReduce processing. Apache Spark will grab the spotlight at Spark Summit 2014 in San Francisco this week, and Databricks, the company behind Spark, will make more announcements that will shake up the big data world.

Dataflow and Spark are making waves because they’re putting MapReduce, a core component of Hadoop, on the endangered-species list. The Hadoop community was already moving away from MapReduce because its slow, batchy nature and obscure programming approaches weren’t compatible with enterprise adoption. Last year’s Hadoop 2.0 release incorporated YARN (Yet Another Resource Negotiator) to support a much wider range of data-analysis approaches. YARN notwithstanding, Spark in particular might just replace much more than MapReduce, even if there’s no intention to kill Hadoop.

Read more

Machine Learning is not Black-Box Magic.


Machine learning is not magic, it’s not AI and it doesn’t really learn like humans. Instead, it’s a very powerful set of tools that can help you gain insights and build predictive systems if you know how it works and how to use it.


Thanks to the explosive growth of big data services and the emergence of various success stories of machine learning in media [1, 2, 3], there are probably more people interested in machine learning now than ever [4]. Additionally, the abundance of online educational resources (such as the popular free class in machine learning) has made it very easy to jump into the topic and the availability of many high quality machine learning tools has enabled technically savvy people to apply machine learning to their problems. In fact, many of the machine learning tools are so well designed (including our own Alpine!) that a person doesn’t have to understand what’s going on underneath to train classifiers (among other models).

Unfortunately, it seems that the hype and the easy tools have led a lot of organizations to jump into the bandwagon without really understanding what makes machine learning work. For example, a couple of heads of data analytics teams I interviewed with told me (paraphrasing), “now we have all these logs, we must surely be able to apply machine learning and improve our services!”, “I don’t know how these algorithms work, but I heard Support Vector Machine is great, so let’s use that to train classifiers!”.

Although these are but a couple of anecdotal examples, I have a strong suspicion that such an attitude might in fact be common among many companies out there, who typically start out with expertise in their particular domains but data science comes later when they acquire enough data from customers. And based on my encounters, I believe that often the people in charge of data analytics tend to misjudge the difficulty of applying machine learning and believe that machine learning can be applied as a black-box component that can automatically discover patterns in their data.

Unfortunately, with this sort of attitude, companies are restricting themselves from fully leveraging their data, and in fact they might be harming themselves by misinterpreting data and doing wrong things. To be a more mature data science oriented organization, one should drop the black box approach and be more serious about understanding various internals of data science.

What’s the danger in treating machine learning as a black box component? This becomes more obvious once one realizes that machine learning is just a tool. It doesn’t magically build a predictive model for you. A person still has to build the model. For e.g., if one is building a model through supervised learning, he/she should already know his/her problem domain well enough to know the data and features that are needed to predict the target. Additionally, this person should know details of his tools (machine learning models), their properties, strengths, weaknesses, etc. (e.g. can the model learn non-linear or non-monotonic relationships?) A good analogy may be a programmer, his/her application domain and his/her programming language. Would you trust a person with little expertise in any language to build a high quality SW system? Can a person who knows nothing about the application area build the system?

Despite the name ‘machine-learning’ (statistics people will tell you that it’s a PR achievement by computer scientists), machine learning algorithms are rarely capable of discovering completely new insights from arbitrary data. E.g. they will rarely discover patterns if they are fed raw data without proper transformations. Machine learning algorithms usually work best if the person using them already has a good ‘theory’ about how the prediction system should be structured (e.g. it helps to know whether variables are monotonically related.). The tools are merely there to help the person ‘configure’ the structure (e.g. find coefficients of linear models, find split points of trees, etc.). Although there are active research areas (such as deep belief network mentioned in this article) that are trying to get machines to learn and discover new structures, once you actually try these algorithms, you’ll quickly realize that even these state-of-the-art algorithms are bound in what they can learn. And despite their intent to be ‘automatic’ feature learners, ironically, these new algorithms require even more in-depth knowledge from the user to be successful.

What are some specific examples of machine learning knowledge that you need to be better at data science? The academic trove of machine learning knowledge is so broad that if one says all of statistics and computer science, he/she may not be exaggerating. But besides thousands of pages of math and engineering knowledge, I believe that there is a more concise list of practical tips that can help trained people avoid certain common mistakes. E.g., here is a partial list of some obvious and some not-so-obvious caveats in machine learning applications.
•Research the problem domain in advance, invest time in reasonable feature collections/transformations/engineering and form reasonable hypotheses about relationship between predictors and the predicted. As I mentioned above, machine learning usually doesn’t work if you don’t already have some good ideas about the problem, the features and the structure of the solution. In particular, figuring out the killer features and transformations will likely consume most of your time. ◦For example, you might find that, when you are doing stock analyses of many different types of companies, you may do much better when features are normalized ‘per-company-type’ rather than globally.
◦Other forms of domain knowledge may be incorporated in the forms of model’s bias (linear, SVM with polynomial kernels, trees, etc.), constraints (e.g. you might already know that a variable should always have a positive coefficient), intercept (in some problems, not having a bias term may make more sense), etc.

•Know the bias of your data. This is probably one of the most important things you have to know in advance. Your data (both training and validation) often do not represent ‘true population’. There are all kinds of biases that you might not be aware of. Don’t assume that a classifier that you trained, even if it performs well in your validation data, can be applied to random population samples, because your training/validation samples may not be random. ◦Some common biases include selection bias (your sensors are not random), presentation bias (e.g. if you’re collecting click information, your data are heavily skewed to what the user sees on the first few pages). More subtle ones are survivorship bias (e.g., when you are doing stock analysis, you may only have data of ‘surviving’ companies and not the failed ones), etc.
◦Failure to know the bias of your data may lead to a catastrophic result in some domains. As an example, I’ve heard a story that the downfall of Long-Term Capital Management in the 1990′s [5, 6] can be partly attributed to its risk model that was trained with a biased sample that contained very few downturns.

•Know the inductive bias of the model and how this will limit what you can learn. Most of the machine learning algorithms out there start with assumptions in the relationship between predictor and the predicted variables (more formally, this is referred to as the hypothesis space of the model). ◦For instance, I suspect that most of the people out there already have pretty intuitive ideas about what linear models mean. However, it helps to be more explicit about details.
◦It turns out that even non-linear algorithms such as decision trees have inductive biases.
◦E.g. with a linear classifier, you can’t learn an XOR function and even decision trees may not be able to learn it with a balanced data set and a vanilla approach. In short it helps to know algorithmic details if you want to learn particular patterns.

•It often helps to do proper scaling/normalization on data. E.g., with certain algorithms scaling your numeric data could yield very different results. ◦Regularized linear models may not make much sense unless you equally normalize all the predictor variables.
◦Neural networks, deep-belief networks are also heavily affected by scaling of features (I found that scaling outputs from lower layers could also help in some cases.).
◦If you want to get some feel for variable importance in unregularized linear models, predictors should be properly normalized.
◦Algorithms based on decision trees (e.g., Random Forest, Boosted Trees, etc.) are usually immune to this sort of ‘monotonic’ feature transformations.

•There are certain classes of algorithms that could be used as black-boxes while yielding good results. E.g., algorithms such as random forest require very little tuning and can capture a lot of non-linear relationships. When you have very little idea about the problem domain initially, these algorithms could provide a good starting point. However, these guys still can benefit greatly from good features and transformations. Additionally, more expressive models like trees are more prone to over-fitting, particularly if you don’t have a lot of data.
•There are usually multiple ways to interpret models, and it may help to look at all these different aspects. For example, Support Vector Machine is usually introduced as a ‘max-margin’ classifier. However, another way to look at SVM is as a L2 regularized hinge-loss model. When looked at as a regularized model, one realizes that feature normalization would be an important prerequisite.
•Be aware of predictors’ origins. Some of them may be derived from the same source as the label. And in such cases, it makes no sense to use them as predictors. Large organizations often have hundreds and thousands of predictor variables and sometimes you may not realize that your predictor variables have the same source as the label. E.g., when your click prediction seems incredibly accurate, one of your predictors could be also be derived from the click information itself. This is sometimes referred to as information leakage.
•In typical industrial data sets, outliers are very common and they can mess up your conclusions. Certain algorithms like linear regression are heavily affected by outliers. E.g., say you are trying to predict house price from square footage, they might have positive linear relationship, but if you have a few strange data samples (say a couple of extremely cheap sales for really large houses), your learned model would get all messed up. You can either do outlier removal or robust-statistic based learning algorithms (trimmed regression, huber loss regression, etc.). I feel that robust techniques are not mentioned often enough in typical literature.
•Explore ‘hyper-parameters’ of the algorithms but be aware that you can over-fit hyper-parameters as well. Algorithms based on support vector machine, boosting, regularization, neural network, etc. have additional parameters you can tune (e.g., lambda or cost variable in SVM that controls the tradeoff between the loss and the regularization terms) and changing these can yield very different results. You should know what these hyper-parameters really mean – e.g. it helps to know the difference between L1 and L2 regularizations, etc.
•Additionally, when you are trying to find ‘optimal’ hyper-parameters of algorithms, you are essentially doing greedy learning of the hyper-parameters. It’s a good idea to divide the data up into three sets – a training set, a validation set 1, and a validation set 2 (or you can do cross-validations). 1.You train your model on the training set with a particular set of hyper-parameters.
2.Measure the performance of the model on the validation set 1.
3.Repeat the steps 1 and 2 with different hyper-parameters. Find the hyper-parameters that yield the best results on the validation set 1. This is essentially hyper-parameter learning.
4.Your real generalization performance should be measured on the validation set 2.

As a useful reference, this paper talks about the potential for over-fitting hyper-parameters.

•Keep in mind that the accuracy is not the only important criterion. If your service depends on run-time prediction, you might prefer simpler models that run faster. Often organizations make more money by processing more requests, rather than being more accurate but slower in their predictions. E.g., the Netflix prize winning algorithm, while impressive in its predictive performance, may not be practical for runtime product recommendations.
•Don’t believe that the state-of-the-art results in literature could be readily applied in your domain. Nowadays, neural network based algorithms (such as deep-belief-net and drop-out network) are breaking all kinds of records with famous data sets and they are exciting to watch. However, training these neural networks is a very difficult task and often times you’ll find that they don’t work miracles when you try to apply them in your domains (or will require a lot of tuning).
•The lack of a strong pattern in your data when you train a model doesn’t prove that there isn’t one. You may not have transformed features properly or you may not be using the right model. I’ve heard a story about some person concluding that there’s no use for a particular feature in predicting stock prices because his linear model said so. However, some other person found that in fact the feature was very useful, with proper transformations.
•Never, ever mix up training and validation data. This sounds like such a basic thing and some people may feel insulted that this is mentioned at all. But I’ve actually seen a senior guy who mixed up training and validation data and refused to acknowledge that his conclusions may be wrong.

Because of all these caveats, it often takes time and several iterations to properly explore/experiment with your data and come up with reasonable conclusions.

In short, if you want to fully take advantages of data your organization has accumulated over time, do not believe in black-box magic of machine learning! Emphasize the skills to understand, interpret and apply internals of machine learning models and algorithms.