• Calender

    October 2012
    M T W T F S S
    « Aug   Nov »
  • Contact

    Send us press releases, events and new product information for publication..

    Email: nchuppala@ outlook.com

  • Advertisements

“Magnetic, Agile, Deep” (MAD) approach to data




MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version).  The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):

  • It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers.  This is a good thing, on many fronts.
  • It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
  • It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data.  They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
  • It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc.  If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
  • It advocates a similarly catholic approach to storage.  Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you.  These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems.  (C’mon, you know it’s true… )

I started to write the paper because it was just too cool what Brian Dolan was doing with Greenplum at Fox Interactive Media (parent company of MySpace.com) — e.g., writing Support Vector Machines in SQL and running it over dozens of TB of data.  Brian was a great sport about taking his real-world experience and good ideas and putting them down on paper for others to read.  Along the way I learned a lot about the data architecture ideas he’s been cooking with Mark Dunlap, which are real thumb in the eye of the warehouse orthodoxy, and make eminent good sense in today’s world.  Finally, it was nice to get to write about the good things that Jeff Cohen and Caleb Welton have been doing at Greenplum to cut through the hype and shrink the distance between SQL and MapReduce.  I’m hoping those guys will have time to sit down one of these days and patiently write up how they’ve done it … it’s really very elegant.

And it still warms my heart that it’s Postgres code underneath all that.  Time to resurrect the xfunc code!


One Response

  1. Please see MADlib, the scalable analytics library that resulted from this work: http://madlib.net

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: