Sunday, April 6, 2014

10 popular Linux commands for Hadoop

The Hadoop system has its unique shell language, which is called FS. Comparing with the common Bash shell within the Linux ecosystem, the FS shell has much fewer commands. To deal with the humongous size of data distributively stored at the Hadoop nodes, in my practice, I have 10 popular Linux command to facilitate my daily work.
1. sort
A good conduct of running Hadoop is to always test the map/reduce programs at the local machine before releasing the time-consuming map/reduce codes to the cluster environment. The sort command simulates the sort and shuffle step necessary for the map/redcue process. For example, I can run the piped commands below to verify whether the Python codes have any bugs.
./mapper.py | sort | ./reducer.py
2. tail
Interestingly, the FS shell at Hadoop only supports the tail command instead of the head command. Then I can only grab the bottom lines of the data stored at Hadoop.
hadoop fs -tail 5 data/web.log.9
3. sed
Sine the FS shell doesn’t provide the head command, the alternative solution is to use the sed command that actually has more flexible options.
hadoop fs -cat data/web.log.9 | sed '1,+5!d'
4. stat
The stat command allows me to know the time when the file has been touched.
hadoop fs -stat data/web.log.9
5. awk
The commands that the FS shell supports usually have very few options. For example the du command under the FS shell does not support -sh option to aggregate the disk usage of the sub-directories. In this case, I have to look for help from the awk command to satisfy my need.
hadoop fs -du data | awk '{sum+=$1} END {print sum}'
6. wc
One of the most important things to understand a file located at the Hadoop is to find the number of its total lines.
hadoop fs -cat data/web.log.9 | wc -l
7. cut
The cut command is convenient to select the specified columns at the file. For example, I am able to count the lines for each of the unique groups from the column between the position of #5 and #14.
hadoop fs -cat data/web.log.9 | cut -c 5-14 | uniq -c
8. getmerge
The great thing for the getmerge command is that I am able to fetch all the result after map/reduce to the local file system as a single file.
hadoop fs -getmerge result result_merged.txt
9. grep
I can start a mapper-only job only with the grep command form the Bash shell to search the lines which contain the key words I am interested in. And this is a map-only task.
hadoop jar $STREAMING -D mapred.reduce.tasks=0 -input data -output result -mapper "bash -c 'grep -e Texas'"
10. at and crontab
The at and crontab commnands are my favorite to schedule a job at Hadoop. For example, I would like to use the order below to clean the map/reduce results at midnight.
at 0212
at > hadoop fs -rmr result

Thursday, March 27, 2014

SAS vs. Python for data analysis

To perform data analysis efficiently, I need a full stack programming language rather than frequently switching from one language to another. That means — this language can hold large quantity of data, manipulate data promptly and easily (e.g. if-then-else; iteration), connect to various data sources such as relational database and Hadoop, apply some statistical models, and report result as graph, table or web. SAS is famous for its capacity to realize such a data cycle, as long as you are willing to pay the annual license fee.
SAS’s long-standing competitor, R, still keeps growing. However, in the past years, the Python community has launched a crazy movement to port R’s jewels and ideas to Python, which resulted in a few solid applications such as pandas and ggplot. With the rapid accumulation of the data-related tools in Python, I feel more comfortable to work with data in Python than R, because I have a bias that Python’s interpreter is more steady than R’s while dealing with data, and sometimes I just want to escape from R’s idiosyncratic syntax such as x<-4 or foo.bar.2000=10.

Actually there is no competition between SAS and R at all: these two dwell in two parallel universes and rely on distinctive ecosystems. SAS, Python, Bash and Perl process data row-wise, which means they input and output data line by line. R, Matlab, SAS/IML, Python/pandas and SQL manipulate data column-wise. The size of data for row-wise packages such as SAS are hard-disk-bound at the cost of low speed due to hard disk. On the contrary, the column-wise packages including R are memory-bound given the much faster speed brought by memory. 
Let’s go back to the comparison between SAS and Python. For most parts I am familiar with in SAS, I can find the equivalent modules in Python. I create a table below to list the similar components between SAS and Python.
SASPython
DATA stepcore Python
SAS/STATStatsModels
SAS/Graphmatplotlib
SAS Statistical Graphicsggplot
PROC SQLsqlite3
SAS/IMLNumPy
SAS Windowing EnvironmentQt Console for iPython
SAS StudioiPython notebook
SAS In-Memory Analytics for HadoopSpark with Python
This week SAS announced some promising products. Interesting, they can be traced to some of the Python’s similar implementations. For example, SAS Studio, a fancy web-based IDE with the feature of code completion, opens an HTML server at local machine and uses a browser to do coding, which is amazingly similar to iPython notebook. Another example is SAS In-Memory Analytics for Hadoop. Given that the old MapReduce path for data analysis is painfully time-consuming and complicated, aggregating memory instead of hard disk across many nodes of a Hadoop cluster is certainly faster and more interactive. Based on the same idea, Apache Spark, which fully supports Python scripting, has just been released to CDH 5.0. It will be interesting to compare Python and SAS’s in-memory ability for data analysis at the level of Hadoop.
Before there is a new killer app for R, at least for now, Python steals R’s thunder to be an open source alternative for SAS.

Friday, March 21, 2014

Use iPython to clear the SAS HTML files

One trivial routine to work with PC SAS is to clear many HTML files created by using SAS, which usually occupy large hard disk space. By default, Windows does not have shell language to find the files with the same prefix as sashtml. I found that iPython is a convenient alternative for a shell language to conduct regular expression, and I only need a few lines of codes to realize my goal.
%bookmark saswork c:\mydocument
!cd saswork  
a = !dir 
b = a.grep('sashtml').fields(4) 
!del $b.s

Sunday, February 9, 2014

Sortable tables in SAS

This is an update of my previous post Make all SAS tables sortable in the output HTML
Previously I manually added the sortable plugin to the SAS output. With the PREHTML statement of PROC TEMPLATE, the sortable HTML template now can be automately saved for the future use.
/* 0 -- Create the sortable HTML template */
proc template;
    define style sortable;
    parent=styles.htmlblue; 
    style body from body /
        prehtml='
            <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.0/jquery.min.js"></script>
            <script src="http://cdn.jsdelivr.net/tablesorter/2.0.5b/jquery.tablesorter.min.js"></script>
            <script>
            $(document).ready(function( ) {    
            $(".table").tablesorter({widgets: ["zebra"]});
            });
            </script>
        ';
    end;
run;

/* 1 -- Make all the tables sortable */
ods html file = 'tmp.html' style = sortable;
proc reg data=sashelp.class;
    model weight = height age;
run;
proc print data=sashelp.class;
run;
While we explore the data or we like to change the order of SAS’s output tables, we only need to click the table heads, which is quite convenient.

Friday, January 3, 2014

Test drive for PROC HADOOP and Pig

PROC HADOOP is available since SAS 9.3M2, which bridges a Windows client and a Hadoop server. The great thing about this procedure is that it supports user-defined function. There are several steps to apply this procedure.
  1. Download Java SE and Eclipse on Windows
    Java SE and Eclipse are free to download. Installation is also fairly easy.
  2. Make user-defined function on Windows
    The most basic user-defined function is an upper-case function for a string that wraps Java’s native str.toUpperCase() function. Pig’s manual has [detail descripton][1] about it.
  3. Package the function as JAR
    There is a wonderful video tutorial on YouTube. Make sure that version of the [Pig API][2] with the name such as pig-0.12.0.jar on Windows is the same to the one running on the Hadoop.
  4. Run PROC HADOOP commands
    # pig_code
    A = load 'test3.txt' as (f1: chararray, f2: chararray, f3: chararray, f4: chararray, f5: chararray);
    describe A;
    register myudfs.jar;
    B = foreach A generate myudfs.UPPER(f3);
    dump B;
    Then we can run the SAS codes with PROC HADOOP. Subsequently one field f3 of the text file on HDFS is capitalized.
    filename cfg "C:\tmp\config.xml";
    filename code "C:\tmp\pig_code.txt";
    proc hadoop options=cfg username="myname" password="mypwd" verbose;
    pig code=code registerjar="C:\tmp\myudfs.jar";
    run;