• Calender

    January 2014
    M T W T F S S
    « Dec   Apr »
     12345
    6789101112
    13141516171819
    20212223242526
    2728293031  
  • Contact

    Send us press releases, events and new product information for publication..

    Email: nchuppala@ outlook.com

Value of Actuate products and services

 

Actuate Corporation has prepared this quick read document to highlight Actuate-related findings of the 2013 “Business Intelligence Market Study,” which is part of the “Wisdom of Crowds®” series of market research reports published by Dresner Advisory Services. Actuate’s well-established customer base and BIRT revenue streams have landed the company a spot in the “Large Established Pure-Play Vendor” category. The company earned improved customer rankings this year across virtually all measures. 

Value of Actuate products and services
■■ 9 out of 10 Actuate customers responding to the survey reported they would recommend the company’s products and services.
■■ Actuate won “best in class” of all 23 top industry vendors for Product: “Reliability of Technology” and “Customization and Extensibility.”
■■ In the category of “Quality of Technical Support,” Actuate outperformed every vendor in the study. 

■■ Actuate’s scores exceeded both peer and overall performance in the vast majority of measures within the categoriesof “Sales,” “Value,” “Technical Support,” and “Consulting.”
■■ Actuate surpassed both peer and overall scores in the majority of measures in the category of “Quality and Usefulness of Product.”

General study findings
■■ Three technologies related to BI increased in importance over 2012: Software as a Service (SaaS) & Cloud Computing (Cloud BI); Dashboards; and Mobile Device Support.
■■ In 2013, the top technologies ranked in the Study are Dashboards, End User “Self Service”, and Advanced Visualization and Data Warehousing.
■■ Executive Management and Finance are most likely to drive BI initiatives, while HR and Manufacturing are least likely.
■■ Across all market sectors, “better decision making” is the most popular objective for BI solutions, with “improved operational efficiency” coming in second.
■■ BI adoption continues to be most prevalent in both the smallest and largest organizations. However, all sizes of organizations (including mid-size) have ambitious plans for BI through 2014 – mainly focused on extending BI capabilities to a larger number of users.

 

https://bimonitor.files.wordpress.com/2014/01/2013-executive-summary-dresner-woc-bi-market.pdf

Advertisements

Six categories of Data Scientists

We are now at 8 categories after a few updates. Just like there are a few categories of statisticians (biostatisticians, statisticians, econometricians, operations research specialists, actuaries) or business analysts (marketing-oriented, product-oriented, finance-oriented, etc.) we have different categories of data scientists. First, many data scientists have a job title different from data scientist, mine for instance is co-founder. Check the “related articles” section below to discover 400 potential job titles for data scientists.

Categories of data scientists

  • Those strong in statistics: they sometime develop new statistical theories for big data, that even traditional statisticians are not aware of.
  • Those strong in data engineering, Hadoop, database/memory/file systems optimization and architecture, API’s, Analytics as a Service, optimization of data flows, data plumbing.
  • Those strong in machine learning / computer science (algorithms, computational complexity)
  • Those strong in business, ROI optimization, decision sciences, involved in some of the tasks traditionally performed by business analysts in bigger companies (dashboards design, metric mix selection and metric definitions, ROI optimization, high-level database design)
  • Those strong in production code development, software engineering (they know a few programming languages)
  • Those strong in visualization
  • Those strong in GIS, spatial data, data modeled by graphs, graph databases
  • Those strong in a few of the above. After 20 years of experience across many industries, big and small companies (and lots of training), I’m strong both in stats, machine learning, business, and more than just familiar with visualization and data engineering. This could happen to you as well over time, as you build experience. I mention this because so many people still think that it is not possible to develop a strong knowledge base across multiple domains that are traditionally perceived as separated (the silo mentality). Indeed, that’s the very reason why data science was created.

Most of them are familiar or expert in big data. 

There are other ways to categorize data scientists, see for instance our article on Taxonomy of data scientists. A different categorization would be creative versus mundane. The “creative” category has a better future, as mundane can be outsourced (anything published in textbooks or on the web can be automated or outsourced – job security is based on how much you know that no one else know or can easily learn).

Implications for other IT professionals

You (engineer, business analyst) probably do already a bit of data science work, and know already some of the stuff that some data scientists do. It might be easier than you think to become a data scientist. Check out our book (listed below in “related articles”), to find out what you already know, what you need to learn, to broaden your career prospects.

Are data scientists a threat to your job/career? Again, check our book (listed below) to find out what data scientists do, if the risk for you is serious (you = the business analyst, data engineer or statistician; risk = being replaced by
a data scientist who does everything) and find out how to mitigate the risk (learn some of the data scientist skills from our book, if you perceive data scientists as competitors)

 

Data Scientist versus Statistician

  • Automated bidding systems
  • Estimating (in real time) the value of all houses in the United States (Zillow.com)
  • High-frequency trading
  • Matching a Google Ad with a user and a web page to maximize chances of conversion
  • Returning highly relevant results to any Google search
  • Book and friend recommendations on Amazon.com or Facebook
  • Tax fraud detection and detection of terrorism
  • Scoring all credit card transactions (fraud detection)
  • Computational chemistry to simulate new molecules for cancer treatment
  • Early detection of a pandemy
  • Analyzing NASA pictures to find new planets or asteroids
  • Weather forecasts
  • Automated piloting (planes and cars)
  • Client-customized pricing system (in real time) for all hotel rooms The problems cover astronomy, fraud detection, social network analytics, search engines, finance (transaction scoring), environment, drug development, trading, engineering, pricing optimization (retail), energy (smart grids), bidding and arbitrage systems.

All this involves both statistical science and terabytes of data. Most people doing this stuff do not call themselves statisticians. They call themselves data scientists.

Statisticians have been gathering data and performing linear regressions for several centuries. DAD (discover / access / distill) performed by statisticians 300 years ago, 20 years ago, today, or in 2015 for that matter, has little to do with DAD performed bydata scientists today. The key message here is that eventually, as more statisticians pick up on these new skills and more data scientists pick up on statistical science (sampling, experimental design, confidence intervals – not just the ones described in chapter 5 in our book), the frontier between data scientists and statisticians will blur. Indeed, I can see a new category of data scientists emerging: data scientists with strong statistical knowledge, just we already have a category of data scientists with significant engineering experience (Hadoop).

Also, what makes data scientists different from computer scientists is that they have a much stronger statistics background, especially in computational statistics, but sometimes also in experimental design, sampling, and Monte Carlo simulations.

Data Scientist versus Data Architect

I recently had the following discussions with a number of data architects, in different communities, in particular (but not limited to) the TDWI group on LinkedIn. This is a summary of the discussion, featuring differences between data scientists and data architects, and how both can work together.

It shows some of the challenges that still need to be addressed before this new analytics revolution is complete. Following are several questions asked by data architects and database administrators, and my answers. The discussion is about optimizing joins in SQL queries, or just moving away from SQL altogether. Several modern databases now offer many of the features discussed here, including hash table joins and fine-tuning the query optimizer by end users. The discussion illustrates the conflicts between data scientists, data architects, and also business analysts. It also touches on many innovative concepts.

Question: You say that one of the bottlenecks with SQL is users writing queries with (say) three joins, when these queries could be split into two queries each with two joins. Can you elaborate?

Answer: Typically, the way I write SQL code is to embed it into a programming language such as Python, and store all lookup tables that I need as hash tables in memory. So I rarely have to do a join, and when I do, it’s just two tables at most.

In some (rare) cases in which lookup tables were too big to fit in memory, I used sampling methods and worked with subsets and aggregation rules. A typical example is when a field in your data set (web log files) is a user agent (browser, abbreviated as UA). You have more unique UAs than can fit in memory, but as long as you keep the 10 million most popular, and aggregate the 200,000,000 rare UAs into a few million categories (based on UA string), you get good results in most applications.

Being an algorithm expert (not an SQL expert), it takes me a couple minutes to do an efficient four-table join via hash tables in Python (using my own script templates). Most of what I do is advanced analytics, not database aggregation: advanced algorithms, but simple to code in Python, such as hidden decision trees. Anyway, my point here is more about non-expert SQL users such as business analysts: Is it easier or more effective to train them to write better SQL code including sophisticated joins, or to train them to learn Python and blend it with SQL code?

To be more specific, what I have in mind is a system where you have to download the lookup tables not very often (maybe once a week) and access the main (fact) table more frequently. If you must re-upload the lookup tables very frequently, then the Python approach loses its efficiency, and you make your colleagues unhappy because of your frequent downloads that slow down the whole system.

Question: People like you (running Python or Perl scripts to access databases) are a DBA’s worst nightmare. Don’t you think you are a source of problems for DBAs?

Answer: Because I’m much better at Python and Perl than SQL, my Python or Perl code is bug-free, easy-to-read, easy-to-maintain, optimized, robust, and re-usable. If I coded everything in SQL, it would be much less efficient. Most of what I do is algorithms and analytics (machine learning stuff), not querying databases. I only occasionally download lookup tables onto my local machine (saved as hash tables and stored as text files), since most don’t change that much from week to week. When I need to update them, I just extract the new rows that have been added since my last update (based on time stamp). And I do some tests before running an intensive SQL script to get an idea of how much time and resources it will consume, and to see whether I can do better. I am an SQL user, just like any statistician or business analyst, not an SQL developer.

But I agree we need to find a balance to minimize data transfers and processes, possibly by having better analytic tools available where the data resides. At a minimum, we need the ability to easily deploy Python code there in non-production tables and systems, and be allocated a decent amount of disk space (maybe 200 GB) and memory (at least several GB).

Question: What are your coding preferences?

Answer: Some people feel more comfortable using a scripting language rather than SQL. SQL can be perceived as less flexible and prone to errors, producing wrong output without anyone noticing due to a bug in the joins.

You can write simple Perl code, which is easy to read and maintain. Perl enables you to focus on the algorithms rather than the code and syntax. Unfortunately, many Perl programmers write obscure code, which creates a bad reputation for Perl (code maintenance and portability). But this does not have to be the case.

You can break down a complex join into several smaller joins using multiple SQL statements and views. You would assume that the DB engine would digest your not-so-efficient SQL code and turn it into something much more efficient. At least you can test this approach and see if it works as fast as one single complex query with many joins. Breaking down multiple joins into several simple statements allows business analysts to write simple SQL code, which is easy for fellow programmers to reuse or maintain.

It would be interesting to see some software that automatically corrects SQL syntax errors (not SQL logical errors). It would save lots of time for many non-expert SQL coders like me, as the same typos that typically occur over and over could be automatically fixed. In the meanwhile, you can use GUIs to produce decent SQL code, using tools provided by most database vendors or open-source, such as Toad for Oracle.

Question: Why do you claim that these built-in SQL optimizers are usually black-box technology for end users? Do you think parameters can’t be fine-tuned by the end user?

Answer: I always like to have a bit of control over what I do, though not necessary a whole lot. For instance, I’m satisfied with the way Perl handles hash tables and memory allocation. I’d rather use the Perl black-box memory allocation/hash table management system than creating it myself from scratch in C, or even worse, write a compiler. I’m just a bit worried with black-box optimization — I’ve seen the damage created by non-expert users who recklessly used black-box statistical software. I’d feel more comfortable if I had at least a bit of control, even as simple as sending an email to the DBA, having her look at my concern or issue, and having her help improve my queries, maybe fine-tuning the optimizer, if deemed necessary and beneficial for the organization and to other users.

Question: Don’t you think tour approach is 20 years old?

Answer: The results are more important than the approach, as long as the process is reproducible. If I can beat my competitors (or help my clients do so) with whatever tools I use, as one would say “”if it ain’t broke, don’t fix it.” Sometimes I use APIs (for example, Google API’s), sometimes I use external data collected with a web crawler, sometimes Excel or Cubes are good enough, and sometimes vision combined with analytic acumen and intuition (without using any data) works well. Sometimes I use statistical models, and other times a very modern architecture is needed. Many times, I use a combination of many of these. I have several examples of “light analytics” doing better than sophisticated architectures

Question: Why did you ask whether your data-to-analytic approach makes sense?

Answer: The reason I asked the question is because something has been bothering me, based on not-so-old observations (3-4 years old) in which the practices that I mention are well entrenched in the analytic community (by analytic, I mean machine learning, statistics, and data mining, not ETL). It is also an attempt to see if it’s possible to build a better bridge between two very different communities: data scientists and data architects. Database builders often (but not always) need the data scientist to bring insights and value out of organized data. And the data scientists often (but not always) need the data architect to build great, fast, efficient data processing systems so they can better focus on analytics.

Question: So you are essentially maintaining a cache system with regular, small updates to a local copy of the lookup tables. Two users like you doing the same thing would end up with two different copies after some time. How do you handle that?

Answer: You are correct that two users having two different copies (cache) of lookup tables causes problems. Although in my case, I tend to share my cache with other people, so it’s not like five people working on five different versions of the lookup tables. Although I am a senior data scientist, I am also a designer/architect, but not a DB designer/architect, so I tend to have my own local architecture that I share with a team. Sometimes my architecture is stored in a local small DB and occasionally on the production databases, but many times as organized flat files or hash tables stored on local drives, or somewhere in the cloud outside the DB servers, though usually not very far if the data is big. Many times, my important “tables” are summarized extracts — either simple aggregates that are easy to produce with pure SQL, or sophisticated ones such as transaction scores (by client, day, merchant, or more granular) produced by algorithms too complex to be efficiently coded in SQL.

The benefit of my “caching” system is to minimize time-consuming data transfers that penalize everybody in the system. The drawback is that I need to maintain it, and essentially, I am replicating a process already in place in the database system itself.

Finally, for a statistician working on data that is almost correct (not the most recent version of the lookup table, but rather data stored in this “cache” system and updated rather un-frequently), or working on samples, this is not an issue. Usually the collected data is an approximation of a signal we try to capture and measure — it is always messy. The same can be said about predictive models, the ROI extracted from a very decent dataset (my “cache”), the exact original most-recent version of the dataset, or a version where 5 percent of noise is artificially injected into it — it is pretty much the same in most circumstances.

Question: Can you comment on code maintenance and readability?

Answer: Consider the issue of code maintenance when someone writing obscure SQL leaves the company — or worse, when SQL is ported to a different environment (not SQL) — and it’s a nightmare for the new programmers to understand the original code. If easy-to-read SQL (maybe with more statements, fewer elaborate high-dimensional joins) runs just as fast as one complex statement because of the internal user-transparent query optimizer, why not use the easy-to-read code instead? After all, the optimizer is supposed to make both approaches equivalent, right? In other words, if two pieces of code (one short and obscure; one longer and easy to read, maintain, and port) have the same efficiency because they are essentially turned into the same pseudo-code by the optimizer, I would favor the longer version that takes less time to write, debug, maintain, and so on.

There might be a market for a product that turns ugly, obscure, yet efficient code into nice, easy-to-read SQL — an “SQL beautifier.” It would be useful when migrating code to a different platform. Although this already exists to some extent, you can easily visualize any query or sets of queries in all DB systems with diagrams. The SQL beautifier would be in some ways similar to a program that translates Assembler into C++. In short, a reverse compiler or interpreter.