Thursday, March 27, 2014

SAS vs. Python for data analysis

To perform data analysis efficiently, I need a full stack programming language rather than frequently switching from one language to another. That means — this language can hold large quantity of data, manipulate data promptly and easily (e.g. if-then-else; iteration), connect to various data sources such as relational database and Hadoop, apply some statistical models, and report result as graph, table or web. SAS is famous for its capacity to realize such a data cycle, as long as you are willing to pay the annual license fee.
SAS’s long-standing competitor, R, still keeps growing. However, in the past years, the Python community has launched a crazy movement to port R’s jewels and ideas to Python, which resulted in a few solid applications such as pandas and ggplot. With the rapid accumulation of the data-related tools in Python, I feel more comfortable to work with data in Python than R, because I have a bias that Python’s interpreter is more steady than R’s while dealing with data, and sometimes I just want to escape from R’s idiosyncratic syntax such as x<-4 or

Actually there is no competition between SAS and R at all: these two dwell in two parallel universes and rely on distinctive ecosystems. SAS, Python, Bash and Perl process data row-wise, which means they input and output data line by line. R, Matlab, SAS/IML, Python/pandas and SQL manipulate data column-wise. The size of data for row-wise packages such as SAS are hard-disk-bound at the cost of low speed due to hard disk. On the contrary, the column-wise packages including R are memory-bound given the much faster speed brought by memory. 
Let’s go back to the comparison between SAS and Python. For most parts I am familiar with in SAS, I can find the equivalent modules in Python. I create a table below to list the similar components between SAS and Python.
DATA stepcore Python
SAS Statistical Graphicsggplot
PROC SQLsqlite3
SAS Windowing EnvironmentQt Console for iPython
SAS StudioIPython notebook
SAS In-Memory Analytics for HadoopSpark with Python
This week SAS announced some promising products. Interesting, they can be traced to some of the Python’s similar implementations. For example, SAS Studio, a fancy web-based IDE with the feature of code completion, opens an HTML server at local machine and uses a browser to do coding, which is amazingly similar to iPython notebook. Another example is SAS In-Memory Analytics for Hadoop. Given that the old MapReduce path for data analysis is painfully time-consuming and complicated, aggregating memory instead of hard disk across many nodes of a Hadoop cluster is certainly faster and more interactive. Based on the same idea, Apache Spark, which fully supports Python scripting, has just been released to CDH 5.0. It will be interesting to compare Python and SAS’s in-memory ability for data analysis at the level of Hadoop.
Before there is a new killer app for R, at least for now, Python steals R’s thunder to be an open source alternative for SAS.


  1. One minor correction. You are correct that the matrices in R and MATLAB store data columnwise. R data frames are also stored columnwise. In contrast, SAS/IML reads, writes, and stores data rowwise, just like the rest of SAS. However, SAS/IML also supplies a convenient syntax to extract, transform, and manipulate data columnwise, just like MATLAB and R supply syntax to access rows of the data. R, MATLAB, and SAS/IML all store the entire data in RAM; that characteristic is independent of how the data are stored.

    1. The TIMEDATA procedure in SAS/ETS too analyses time-stamped data in column format and enables the use of custom functions defined with the FCMP procedure.

    2. Having worked for Insurance companies (and thus my domain is strictly limited to insurance), SAS provides the following benefits:
      a) Connectivity to databases (DB2 on the mainframe, Oracle, SQL Server etc) and thus, ability to pull in huge datasets. A lot of people who work in analytics will tell you that, SAS is a very good tool for cleansing data (data manipulation). How do R and Python compare with respect to SAS ? Can they "connect" and download millions of rows across different databases ?

      b) From my personal experience, not everybody uses cutting edge statistical model/new algorithms/papers in their work. We call it RTB - Running the Business , wherein the most "complicated" model might be a simple GLM model. For SAS users - this might be a PROC GENMOD - which is already available and does all the heavy lifting. For the most part, in my job role I am not trying to out-innovate anything - in fact, most large businesses are averse to "new" algorithms - simply because of skepticism regarding the impact to the underlying business. As an example - Auto Insurance premiums have been modeled using GLM for several years now - I had a proposal to implement using Gradient Boosting - but it remained at that, a proposal. Unless I can show significant lifts over time over multiple regions - no way, I am implementing anything new. From the above context/perspective, is switching to R or Python or a combination of Python+R (or a myriad other combinations) worth it ? Granted - our SAS license costs some money - but it is already budgeted and accounted for. I know this is a loaded question - and in some sense I struggle because of all the options that are available out there. Python, Ipython, anaconda, rpy/rpy2, pandas, nympy, Julia, R, ggplot2 - I am mixing and matching indiscriminately - but in some sense that is my point, I don't have this problem of "over-flexibility" in SAS.

      Am I a dinosaur in the rapidly evolving analytics world ?

    3. Yes, you are a dinosaur in a way. I do work for the same insurance industry segment as well, where GLM could be the most complicated method one worked for.
      The issue is where the data is stored is changing very fast and you do want to analyze data at a very fast speed where it may reside better (e.g., Hadoop repositories). SAS would not move a finger until the data comes into SAS proprietary world.
      In short, where the data is stored, how it is managed and where the insights are stored is changing very fast. That should be concerning you more than the sophistication behind analytics itself.

  2. Charlie. could not agree more. I am hearing the same news re Python in the Academic circles in Canada. Are there scalable or in-memory open source versions of Python and R? Most of the In memory or Scalable products I see is proprietary in nature once a consumer moves beyond the level of a single computer. Even for R eg) Revolution.