• Calender

    April 2016
    M T W T F S S
    « Mar   May »
  • Contact

    Send us press releases, events and new product information for publication..

    Email: nchuppala@ outlook.com

The Life of a Data Scientist


Data scientists are big data wranglers. They take an enormous mass of messy data points (unstructured and structured) and use their formidable skills in math, statistics and programming to clean, massage and organize them. Then they apply all their analytic powers – industry knowledge, contextual understanding, skepticism of existing assumptions – to uncover hidden solutions to business challenges.

Data Scientist Responsibilities

“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”

On any given day, a data scientist may be required to:

  • Conduct undirected research and frame open-ended industry questions
  • Extract huge volumes of data from multiple internal and external sources
  • Employ sophisticated analytics programs, machine learning and statistical methods to prepare data for use in predictive and prescriptive modeling
  • Thoroughly clean and prune data to discard irrelevant information
  • Explore and examine data from a variety of angles to determine hidden weaknesses, trends and/or opportunities
  • Devise data-driven solutions to the most pressing challenges
  • Invent new algorithms to solve problems and build new tools to automate work
  • Communicate predictions and findings to management and IT departments through effective data visualizations and reports
  • Recommend cost-effective changes to existing procedures and strategies

Every company will have a different take on job tasks. Some treat their data scientists as glorified data analysts or combine their duties with data engineers; others need top-level analytics experts skilled in intense machine learning and data visualizations.

As data scientists achieve new levels of experience or change jobs, their responsibilities invariably change. For example, a person working alone in a mid-size company may spend a good portion of the day in data cleaning and munging. A high-level employee in a business that offers data-based services may be asked to structure big data projects or create new products.

An Interview with a Real Data Scientist

airbnbWe caught up with Lisa Qian, Data Scientist at Airbnb, to find out what it’s like to work as a data scientist. Read on to learn about the impact data science has on Airbnb’s success, the programming languages they use on the job, and what students need to know in order to succeed.

Q: What are the top pros & cons of your job?
A: Things happen very quickly and data scientists have a big impact (see answer to next question). At Airbnb, there are so many interesting problems to work on and so much interesting data to play with. The culture of the company also encourages us to work on lots of different things. I have been at Airbnb for less than two years and I have already worked on three completely different product teams. There’s really never a dull moment. This can also be a “con” of the job. Because there are so many interesting things to work on, I often wish that I had more time to go more in depth on a project. I’m often juggling multiple projects at once, and when I’m 90% done with one of them, I’ll just move on to something else. Coming from academia where one spends years and years on one project without leaving a single rock unturned (I did a PhD in physics), this has been a delightful, but sometimes frustrating, cultural transition.
Q: How much of an impact do data scientists have on Airbnb’s overall success?
A: A ton! As a data scientist, I’m involved in every step of a product’s life cycle. For example, right now I am part of the Search team. I am heavily involved in research and strategizing where I use data to identify areas that we should invest in and come up with concrete product ideas to solve these problems. From there, if the solution is to come up with a data product, I might work with engineers to develop the product. I then design experiments to quantify the effect and impact of the product, and then run and analyze the experiment. Finally, I will take what I learned and provide insights and suggestions for the next product iteration. Every product team at Airbnb has engineers, designers, product managers, and one or more data scientists. You can imagine the impact data scientists have on the company!
Q: Which skills or programming languages do you most frequently use in your work, and why?
A: At Airbnb, we all use Hive (which is similar to SQL) to query data and build derived tables. I use R to do analysis and build models. I use Hive and R every day of the job. A lot of data scientists use Python instead of R – it’s just a matter of what we were familiar with when we came in. There have also been recent efforts to use Spark to build large-scale machine learning models. I haven’t gotten a chance to try it out yet, but plan on doing so in the near future. It seems very powerful.
Q: What kind of person makes the best data scientist?
A: Successful data scientists have a strong technical background, but the best data scientists also have great intuition about data. Rather than throwing every feature possible into a black box machine learning model and seeing what comes out, one should first think about if the data makes sense. Are the features meaningful, and do they reflect what you think they should mean? Given the way your data is distributed, which model should you be using? What does it mean if a value is missing, and what should you do with it? The answers to these questions differ depending on the problem you are solving, the way the data was logged, etc., and the best data scientists look for and adapt to these different scenarios.The best data scientists are also great at communicating, both to other data scientists and non-technical people. In order to be effective at Airbnb, our analyses have to be both technically rigorous and presented in a clear and actionable way to other members of the company.

Q: What advice would you offer students preparing for a position as a data scientist?
A: Beyond taking programming and statistics courses, I would recommend doing everything possible to get your hands dirty and work with real data. If you don’t have the time to do an internship, sign up to participate in hackathons or offer to help out a local startup by tackling a data problem they have. Courses and books are great for developing fundamental technical skills, but many data science skills can’t be properly developed in a classroom where data sets are well groomed.

Data Scientist Salaries

The term “data scientist” is the hottest job title in the IT field – with starting salaries to match. It should come as no surprise that Silicon Valley is the new Jerusalem. According to a 2014 Burtch Works study, 36% of data scientists work on the West Coast. Entry-level professionals in that area earn a median base salary of $100,000 – 22% more than their Northeast peers.

Data Scientist

Average Salary (2015): $118,709 per year
Minimum: $76,000
Maximum: $148,000

Median Salary (2015): $93,991 per year
Total Pay Range: $63,524 – $138,123

Senior Data Scientist

Median Salary (2015): $124,273 per year
Total Pay Range: $89,801 – $179,445

Data Scientist Qualifications

What Kind of Degree Will I Need?

Broadly speaking, you have 3 education options if you’re considering a career as a data scientist:

  1. Degrees and graduate certificates provide structure, internships, networking and recognized academic qualifications for your résumé. They will also cost you significant time and money.
  2. MOOCs and self-guided learning courses are free/cheap, short and targeted. They allow you to complete projects on your own time – but they require you to structure your own academic path.
  3. Bootcamps are intense and faster to complete than traditional degrees. They may be taught by practicing data scientists, but they won’t give you degree initials after your name.

Academic qualifications may be more important than you imagine. As Burtch Works notes, “it’s incredibly rare for someone without an advanced quantitative degree to have the technical skills necessary to be a data scientist.”

In its data science salary report, Burtch Works determined that 88% of data scientists have a master’s degree and 46% have a PhD. The majority of these degrees are in rigorous quantitative, technical or scientific subjects, including math and statistics (32%), computer science (19%) and engineering (16%).

With that being said, companies are desperate for candidates with real-world skills. Your technical know-how may trump preferred degree requirements.

Note: Check out our list of 23 Great Schools with Master’s Programs in Data Science.

What Kind of Skills Will I Need?

Technical Skills

  • Math (e.g. linear algebra, calculus and probability)
  • Statistics (e.g. hypothesis testing and summary statistics)
  • Machine learning tools and techniques (e.g. k-nearest neighbors, random forests, ensemble methods, etc.)
  • Software engineering skills (e.g. distributed computing, algorithms and data structures)
  • Data mining
  • Data cleaning and munging
  • Data visualization (e.g. ggplot and d3.js) and reporting techniques
  • Unstructured data techniques
  • R and/or SAS languages
  • SQL databases and database querying languages
  • Python (most common), C/C++ Java, Perl
  • Big data platforms like Hadoop, Hive & Pig
  • Cloud tools like Amazon S3

This list is always subject to change. As Anmol Rajpurohit suggests, “generic programming skills are a lot more important than being the expert of any particular programming language.”

Business Skills

  • Analytic Problem-Solving: Approaching high-level challenges with a clear eye on what is important; employing the right approach/methods to make the maximum use of time and human resources.
  • Effective Communication: Detailing your techniques and discoveries to technical and non-technical audiences in a language they can understand.
  • Intellectual Curiosity: Exploring new territories and finding creative and unusual ways to solve problems.
  • Industry Knowledge: Understanding the way your chosen industry functions and how data are collected, analyzed and utilized.

Note: You can view a handy trajectory on How to Become a Data Scientist in an infographic from Datacamp.

What About Certifications?

To avoid wasting time on poor quality certifications, ask your mentors for advice, check job listing requirements and consult articles like Tom’s IT Pro “Best Of” certification lists. Here are a few that focus on useful skills:

Certified Analytics Professional (CAP)

CAP was created in 2013 by the Institute for Operations Research and the Management Sciences (INFORMS) and is targeted towards data scientists. During the certification exam, candidates must demonstrate their expertise of the end-to-end analytics process. This includes the framing of business and analytics problems, data and methodology, model building, deployment and life cycle management.


  • 5+ years of analytics work-related experience for BA/BS holder in a related area
  • 3+ years of analytics work-related experience for MA/MS (or higher) holder in a related area
  • 7+ years of analytics work-related experience for BA/BS (or higher) holder in an unrelated area
  • Verification of soft skills/provision of business value by employer
  • Agreement to adhere to Code of Ethics

Cloudera Certified Professional: Data Scientist (CCP:DS)

Targeted towards the elite level, the CCP:DS is aimed at data scientists who can demonstrate advanced skills in working with big data. Candidates are drilled in 3 exams – Descriptive and Inferential Statistics, Unsupervised Machine Learning and Supervised Machine Learning – and must prove their chops by designing and developing a production-ready data science solution under real-world conditions.

Related Cloudera certifications include:

EMC: Data Science Associate (EMCDSA)

The EMCDSA certification tests your ability to apply common techniques and tools required for big data analytics. Candidates are judged on their technical expertise (e.g. employing open source tools such as “R”, Hadoop, and Postgres, etc.) and their business acumen (e.g. telling a compelling story with the data to drive business action).

Once you’ve passed the EMCDSA, you can consider the Advanced Analytics Specialty. This works on developing new skills in areas such as Hadoop (and Pig, Hive, HBase), Social Network Analysis, Natural Language Processing, data visualization methods and more.

SAS Certified Predictive Modeler using SAS Enterprise Miner 7

This certification is designed for SAS Enterprise Miner users who perform predictive analytics. Candidates must have a deep, practical understanding of the functionalities for predictive modeling available in SAS Enterprise Miner 7 before they can take the performance-based exam. This exam includes topics such as data preparation, predictive models, model assessment and scoring and implementation.

Related SAS certifications include:

Jobs Similar to Data Scientist

Some data scientists get their start working as low-level Data Analysts, extracting structured data from MySQL databases or CRM systems, developing basic visualizations or analyzing A/B test results. These jobs aren’t usually that challenging.

However, once you have your technical skills in order, you have plenty of options. If you’d like to push beyond your analytical role, you could think about building/engineering/architecture jobs such as:

Data Scientist Job Outlook

In an oft-cited 2011 big data study, McKinsey reported that by 2018 the U.S. could face a shortage of 140,000 to 190,000 “people with deep analytic skills” and 1.5 million “managers and analysts with the know-how to use the analysis of big data to make effective decisions.”

The ensuing panic has led to high demand for data scientists. Companies of every size and industry – from Google, LinkedIn and Amazon to the humble retail store – are looking for experts to help them wrestle big data into submission. Starting salaries are astronomical.

The bubble is bound to burst, of course. In a 2014 Mashable article, Roy Lowrance, the managing director of New York University’s Center for Data Science program, is quoted as saying “anything that gets hot like this can only cool off.” But even as demand for data engineers surges, job postings for big data experts are expected to remain high.

There are also some indications that the roles of data scientists and business analysts are beginning to merge. In certain companies, “new look” data scientists may find themselves responsible for financial planning, ROI assessment, budgets and a host of other duties related to the management of an organization.

Professional Organizations for Data Scientists

What is functional programming?

At its core, functional programming is about immutability and about composing
functions rather than objects. Many related characteristics fall out of this
Functional programs do the following:
  1. Have first-class functions: First-class functions are functions that can be passed around, dynamically created, stored in data structures, and treated like any other first-class object in the language.
  2. Favor pure functions: Pure functions are functions that have no side effects. A side effect is an action that the function does that modifies state outsidethe function.
  3. Compose functions: Functional programming favors building programs from the bottom up by composing functions together.
  4. Use expressions: Functional programming favors expressions over statements. Expressions yield values. Statements do not and exist only to control theflow of a program.
  5. Use Immutability: Since functional programming favors pure functions, whichcan’t mutate data, it also makes heavy use of immutable data. Instead ofmodifying an existing data structure, a new one is efficiently created.
  6. Transform, rather than mutate, data: Functional programming uses functions to transform immutable data. One data structure is put into the function, and a new immutable data structure comes out. This is in explicit contrast with the popular object-oriented model, which views objects as little packets of mutable state and behavior.

Five Intuitive Predictions About Analytics


David Wagner
David Wagner, Community Editor

A2 is quite rightly focused on the present and the near future. But sometimes to make sure you don’t get bogged down too much in the present, you want to look ahead to get a sense of where you are going. So I decided to take some time to think of what analytics might look like at least five years out.

At first, I searched high and low for statistical trends and wanted to do my best at using analytics to predict the future of analytics. Then, part way through, I decided it would be much more fun to do what most of the A2 community is used to — I’d ignore all the great data just like your internal customers and instead use my intuition.

I’ve been covering technology and analytics for over a decade. I’ve seen trends come and go and I’ve seen the hype cycle first hand. If I can’t put my finger in the air and tell which way the wind is blowing, what good am I? So here we go. I’m making five predictions for state of analytics into the future.

In five years, inside the enterprise, analytics is just going to be called “management.” The drive for “data-driven management” and “data-driven organizations” is going to be so successful that we are going to stop talking about analytics as a separate discipline. Managers won’t ask what the BI or analytics said to make a decision. It will be ingrained in the process. Does that mean A2 is in for a name change? No, because there is still going to be a discipline for those that prepare the numbers. They are just going to fade into the background. I think the job of data scientist is much like “webmaster” was in the 90’s. We’re going to have less grandiose names while doing more work.

I’ve seen this already with “big data.” Just two years ago, CIOs would say to me “I need to invest in big data.” Now they say, “we’re investing in a new database to track customer data.” When the language around a thing changes from the hype word to the specific, it means it is part of the norm.

In five to 10 years (OK, maybe 20), personalized medicine will lead to diagnosing cancer with analytics. Multiple big data programs around cancer including the Wisdom study and the Precision Medicine Initiative are going to start leading to clues on how we can identify cancer earlier, based on things like your search terms, Facebook status, and fitness tracker data, rather than blood tests and mammograms. The clues will come in terms of early warning signs that we often ignore now as just feeling under the weather. Data will be able to combine genetic pre-disposition, health history, and our own data from daily life to see things before we see them ourselves. Of course, we’ll still need to take the test to confirm, but it will be the analytics that inspires us to the get the test.

In the next 10 years, a major sports franchise will hire a computer to be an in-game manager or general manager. Of course, a human will still have to do the interactive part like trade negotiations or telling a player they are being substituted. But someone will decide to trust in the analytics engine to be better than a human at making decisions like removing pitchers or punting versus “going for it” or for drafting players.

In 10 years, Hollywood will be routinely re-writing and re-shooting movie scripts based on biometric analysis of preview crowds. We’re already starting to track the data. It is only a matter of time until some fool thinks he can write a blockbuster using the data, and he’ll probably be right.

Driverless cars and trucks will nice, but it is the unimaginable amount of data from driverless cars and trucks that will allow us to finally live our “Jetsons-style” future we’ve been waiting for. Traffic flow through cities, combined with the more moderate data from personal fitness trackers and phones, will allow us to redesign city infrastructure like roads, water, WiFi, sewage, garbage, and everything else. Without the driverless car data, cities will look basically as they did in the 1980’s forever.

Which of my predictions look like they will happen and which make you think my crystal ball is broken? Tell me in the comments.

Source: http://www.allanalytics.com/author.asp?section_id=3618&doc_id=280191&print=yes

Citizen data scientist

The worldwide shortage of data scientists won’t end anytime soon. To try to compensate for the shortage, data discovery solutions are automating tasks that have traditionally been done manually by a data scientist, statistician, or other analytics expert. The confluence of trends is giving rise to a new role that Gartner calls a “citizen data scientist.”

A recent Gartner report defines a citizen data scientist as “a person who creates or generates models that leverage predictive or prescriptive analytics but whose primary job function is outside of the field of statistics and analytics.”


Block Chain technology

The terminology of this new field is still evolving, with many using the terms block chain (or blockchain), distributed ledger and shared ledger interchangeably. Formal definitions are unlikely to satisfy all parties —

  1. A block chain is a type of database that takes a number of records and puts them in a block (rather like collating them on to a single sheet of paper). Each block is then ‘chained’ to the next block, using a cryptographic signature. This allows block chains to be used like a ledger, which can be shared and corroborated by anyone with the appropriate permissions.

There are many ways to corroborate the accuracy of a ledger, but they are broadly known as consensus (the term ‘mining’ is used for a variant of this process in the cryptocurrency Bitcoin) — see below.

If participants in that process are preselected, the ledger is permissioned. If the process is open to everyone, the ledger is unpermissioned — see below.

The real novelty of block chain technology is that it is more than just a database — it can also set rules about a transaction (business logic) that are tied to the transaction itself. This contrasts with conventional databases, in which rules are often set at the entire database level, or in the application, but not in the transaction.


  1. Unpermissioned ledgers such as Bitcoin have no single owner — indeed, they cannot be owned. The purpose of an unpermissioned ledger is to allow anyone to contribute data to the ledger and for everyone in possession of the ledger to have identical copies. This creates censorship resistance, which means that no actor can prevent a transaction from being added to the ledger. Participants maintain the integrity of the ledger by reaching a consensus about its state.

Unpermissioned ledgers can be used as a global record that cannot be edited: for declaring a last will and testament, for example, or assigning property ownership. But they also pose a challenge to institutional power structures and existing industries, and this may warrant a policy response.


    1. Permissioned ledgers may have one or many owners. When a new record is added, the ledger’s integrity is checked by a limited consensus process. This is carried out by trusted actors — government departments or banks, for example — which makes maintaining a shared record much simpler that the consensus process used by unpermissioned ledgers. Permissioned block chains provide highly-verifiable data sets because the consensus process creates a digital signature, which can be seen by all parties. Requiring many government departments to validate a record could give a high degree of confidence in the record’s security, for example, in contrast to the current situation where departments often have to share data using pieces of paper. A permissioned ledger is usually faster than an unpermissioned ledger.


  1. Distributed ledgers are a type of database that is spread across multiple sites, countries or institutions, and is typically public. Records are stored one after the other in a continuous ledger, rather than sorted into blocks, but they can only be added when the participants reach a quorum.

    A distributed ledger requires greater trust in the validators or operators of the ledger. For example, the global financial transactions system Ripple selects a list of validators (known as Unique Node Validators) from up to 200 known, unknown or partially known validators who are trusted not to collude in defrauding the actors in a transaction. This process provides a digital signature that is considered less censorship resistant than Bitcoin’s, but is significantly faster.

    1. A shared ledger is a term coined by Richard Brown, formerly of IBM and now Chief Technology Officer of the Distributed Ledger Group, which typically refers to any database and application that is shared by an industry or private consortium, or that is open to the public. It is the most generic and catch-all term for this group of technologies.

    A shared ledger may use a distributed ledger or block chain as its underlying database, but will often layer on permissions for different types of users. As such, ‘shared ledger’ represents a spectrum of possible ledger or database designs that are permissioned at some level. An industry’s shared ledger may have a limited number of fixed validators who are trusted to maintain the ledger, which can offer significant benefits.

    1. Smart contracts are contracts whose terms are recorded in a computer language instead of legal language. Smart contracts can be automatically executed by a computing system, such as a suitable distributed ledger system. The potential benefits of smart contracts include low contracting, enforcement, and compliance costs; consequently it becomes economically viable to form contracts over numerous low-value transactions. The potential risks include a reliance on the computing system that executes the contract. At this stage, the risks and benefits are largely theoretical because the technology of smart contracts is still in its infancy, and some time away from widespread deployment.



“Distributed ledgers have the potential to be radically disruptive. Their processing capability is real time, near tamper-proof and increasingly low-cost. They can be applied to a wide range of industries and services, such as financial services, real estate, healthcare and identity management. They can underpin other software-and hardware-based innovations such as smart contracts and the Internet of Things.”


“their distributed consensual nature they may be perceived as threatening the role of trusted intermediaries in positions of control within traditionally hierarchical organisations such as banks and government departments.