Wednesday, September 24, 2014

One example of test-driven development in Python

Function or method is the most basic unit in Python programming. Test-driven development is a key for a developer to assure the code quality of those units. In his book, Harry Percival illustrated a few great examples about how to use TDD with Python and Django. It seems that for web development, TDD including unit testing and integration testing is the cornerstone for every success. For data analysis, coding mostly relies on built-in packages instead large framework like Django, which makes TDD easier. In my opnion, TDD in data analysis could have three steps.
  • Step 1: requirement analysis
    Before writing any code for data analysis, the programmer should seriously ask the customer or himself about the requirements.
    • What are the input parameter?
    • what if the input data doesn’t fit the assumptions?
    • What is the purpose of this funtction or method? what are the desired outputs?
  • For example, there is a recent coding challenge called Maximum Product Subarray.
      > Find the contiguous subarray within an array (containing at least one number) which has the largest product.
      For example, given the array [2,3,-2,4],
      the contiguous subarray [2,3] has the largest product = 6.
    
  • OK, understanding this question is quite straight-forward. Given a array(or a list in Python), you return the integer that is the maximum product from a continuous subarry out of the input array.
      def maxProduct(A):
          """ A function to find the maximum product value for a continuous subarray.
          :param A: an array or list
          :type A: list
          """
          if A is None or not isinstance(A, list):
              return None
          if len(A) == 1:
              return A[0]
          pass
    
    • A production version of the codes above should be more like:
      class FunctionInputError(Exception):
        pass
      
      def maxProduct(A):
        """ A function to find the maximum product value for a continuous subarray.
        :param A: an array or list
        :type A: list
        """
        if A is None or not isinstance(A, list):
            raise FunctionInputError('must give a list as input')
        if len(A) == 1:
            return A[0]
        pass
      
  • Step 2: write test cases
    Given not a single line of logic codes has been writen yet, I call the current step as black-box testing, which means that I want this funtion to fail every test cases. Python has a built-in module doctest, which allows embedding the test cases within the docstring. I write six test cases, run the psedu-function below and arbitrarily specify the result to be -1. As expected, it fails all the six test cases with horrible red warnings under the Python shell. That is a good thing: it proves that the testing works.
      import doctest
      def maxProduct(A):
          """ A function to find the maximum product value for a continuous subarray.
          :param A: an array or list
          :type A: list
    
          - testcase1
              >>> maxProduct([0, 2, 3,-2,4, 1, -1])
              48
    
          - testcase2
              >>> maxProduct([2, 3, 0, 2, 4, 0, 3, 4])
              12
    
          - testcase3
              >>> maxProduct([0, 5, 3, -1, -2, 0, -2, 4, 0, 3, 4])
              30
    
          - testcase4
              >>> maxProduct([0, 1, -3, -4, -5, -1, 1, 2, 1, -1, -100, 0, -100000])
              12000
    
          - testcase5
              >>> maxProduct([0, -3000, -2, 0, -100, -100, 0, -9, -8, 1, 1, 2])
              10000
    
          - testcase6
              >>> maxProduct([0, -2, 0])
              0
          """
          if A is None or not isinstance(A, list):
              return None
          if len(A) == 1:
              return A[0]
          return -1
    
      doctest.testmod()
    
  • Step 3: implement the logic
    It’s time to tackle the most difficult part: write the real function. Think about time complexity (it is best to use only one iteration around the input array which means O(n)), and space complexity (it is best not to use extra space). Run testmod() again and again to find mistakes and modify the codes accordingly. Finally I come with a solution with a helper shadow function _maxProduct. And it passes the six test cases. Althoug I am not sure that this function does not have any bug, at least it works now.
      import doctest
      from sys import maxint
    
      def maxProduct(A):
          """ A function to find the maximum product value for a continuous subarray.
          :param A: an array or list
          :type A: list
    
          - testcase1
              >>> maxProduct([0, 2, 3,-2,4, 1, -1])
              48
    
          - testcase2
              >>> maxProduct([2, 3, 0, 2, 4, 0, 3, 4])
              12
    
          - testcase3
              >>> maxProduct([0, 5, 3, -1, -2, 0, -2, 4, 0, 3, 4])
              30
    
          - testcase4
              >>> maxProduct([0, 1, -3, -4, -5, -1, 1, 2, 1, -1, -100, 0, -100000])
              12000
    
          - testcase5
              >>> maxProduct([0, -3000, -2, 0, -100, -100, 0, -9, -8, 1, 1, 2])
              10000
    
          - testcase6
              >>> maxProduct([0, -2, 0])
              0
          """
          if A is None or not isinstance(A, list):
              return None
          if len(A) == 1:
              return A[0]
          return max(_maxProduct(A), _maxProduct([a for a in reversed(A)]))
    
      def _maxProduct(A):
          max_val_forward = 1
          rst = -maxint
          for a in A:
              if a != 0:
                  max_val_forward *= a
                  rst = max(rst, max_val_forward)
              else:
                  rst = max(0, rst)
                  max_val_forward = 1
          return rst
    
      if __name__ == "__main__":
          doctest.testmod()
    
In conclusion, the most important thing about TDD in data analysis is writing test cases, which really needs a lot of training and exercises.

Sunday, August 10, 2014

Translate SAS's sas7bdat format to SQLite and Pandas

Thanks Jared Hobbs’ sas7bdat package, Python can read SAS’s data sets quickly and precisely. And it will be great to have a few extension functions to enhance this package with SQLite and Pandas.
The good things to transfer SAS libraries to SQLite:
  1. Size reduction:
    SAS’s sas7bdat format is verbose. So far successfully loaded 40GB SAS data to SQLite with 85% reduction of disk usage.
  2. Save the cost to buy SAS/ACCESS
    SAS/ACCESS costs around $8,000 a year for a server, while SQLite is accessible for most common softwares.
The good things to transfer SAS data set to Pandas:
  1. Pandas’ powerful Excel interface:
    Write very large Excel file quickly as long as memory can hold data.
  2. Validation of statistics
    Pandas works well with statsmodels and scikit-learn. Easy to validate  SAS’s outputs. 

Tuesday, June 10, 2014

Remove tabs from SAS code files

By default, SAS records the indent by pressing the tab key by tab, which causes many problem to use the code files under a different environment. There are actually two ways to eliminate the tab character in SAS and replace with empty spaces.
  • Regular expression
    Press Ctrl + H → Replace window pops out → Choose Regular expression search → At the box of Find text input \t→ At the box of Replace input multiple\s, say four

  • Editor option
    Click Tools → Options → Enhanced Editors… → Choose Insert spaces for tabs → Choose Replace tabs with spaces on file open

Monday, May 26, 2014

Steps to deploy Flask's minitwit on Google App Enginee

Flask is a light-weight web framework for Python, which is well documented and clearly written. Its Github depository provides a few examples, which includes minitwit. The minittwit website enjoys a few basic features of social network such as following, login/logout. The demo site on GAE is http://minitwit-123.appspot.com. The Github depo is https://github.com/dapangmao/minitwit.
Google App Engine or GAE is a major public clouder service besides Amazon EC2. Among the four languages(Java/Python/Go/PHP) it supports, GAE is friendly to Python users, possibly because Guido van Rossum worked there and personally created Python datastore interface. As for me, it is a good choice for a Flask app.

Step1: download GAE SDK and GAE Flask skeleton

GAE’s Python SDK tests the staging app and eventuall pushes the app to the cloud.
A Flask skeleton can be dowloaded from Google Developer Console. It contains three files:
  • app.yaml: specify the entrance of run-time
  • appengine_config.py: add the external libraries such as Flask to system path
  • main.py: the root Python program

Step2: schema design

The dabase used for the original minitwit is SQLite. The schema consists of three tables: user, follower and message, which makes a normalized database together. GAE has two Datastore APIs: DB and NDB. Since neither of them supports joining (in this case one-to-many joining for user to follower), I move the follwer table as an nested text propery into the user table, which eliminatse the need for joining.
As the result, the main.py has two data models: User and Message. They will create and maintain two kinds (or we call them as tables) with the same names in Datastore.
class User(ndb.Model):
  username = ndb.StringProperty(required=True)
  email = ndb.StringProperty(required=True)
  pw_hash = ndb.StringProperty(required=True)
  following = ndb.IntegerProperty(repeated=True)
  start_date = ndb.DateTimeProperty(auto_now_add=True)

class Message(ndb.Model):
  author = ndb.IntegerProperty(required=True)
  text = ndb.TextProperty(required=True)
  pub_date = ndb.DateTimeProperty(auto_now_add=True)
  email = ndb.StringProperty(required=True)
  username = ndb.StringProperty(required=True)

Step3: replace SQL statements

The next step is to replace SQL operations in each of the routing functions with NDB’s methods. NDB’s two fundamental methods are get() that retrieves data from Datastore as a list, and put() that pushes list to Datastore as a row. In short, data is created and manipulated as individual object.
For example, if a follower needs to add to a user, I first retrieve the user by its ID that returns a list like [username, email, pw_hash, following, start_date], where following itself is a list. Then I insert the new follower into the following element and save it back again.
u = User.get_by_id(cid)
if u.following is None:
  u.following = [whom_id]
  u.put()
else:
  u.following.append(whom_id)
  u.put()
People with experience in ORM such as SQLAlchemy will be comfortable to implement the changes.

Setp4: testing and deployment

Without the schema file, now the minitwit is a real single file web app. It’s time to use GAE SDK to test it locally, or eventually push it to the cloud. On GAE, We can check any error or warning through the Logs tab to find bugs, or view the raw data through the Datastore Viewer tab.
In conclusion, GAE has a few advantages and disadvantages to work with Flask as a web app.
  • Pro:
    • It allows up to 25 free apps (great for exercises)
    • Use of database is free
    • Automatical memoryCached for high IO
  • Con:
    • Database is No-SQL, which makes data hard to port
    • More expensive for production than EC2

Wednesday, May 21, 2014

Use recursion and gradient ascent to solve logistic regression in Python

In his book Machine Learning in Action, Peter Harrington provides a solution for parameter estimation of logistic regression . I use pandas and ggplot to realize a recursive alternative. Comparing with the iterative method, the recursion costs more space but may bring the improvement of performance.
# -*- coding: utf-8 -*-
"""
Use recursion and gradient ascent to solve logistic regression in Python
"""

import pandas as pd
from ggplot import *

def sigmoid(inX):
    return 1.0/(1+exp(-inX))

def grad_ascent(dataMatrix, labelMat, cycle):
    """
    A function to use gradient ascent to calculate the coefficients
    """
    if isinstance(cycle, int) == False or cycle < 0:
        raise ValueError("Must be a valid value for the number of iterations")
    m, n = shape(dataMatrix)
    alpha = 0.001
    if cycle == 0:
        return ones((n, 1))
    else:
        weights = grad_ascent(dataMatrix, labelMat, cycle-1)
        h = sigmoid(dataMatrix * weights)
        errors = (labelMat - h)
        return weights + alpha * dataMatrix.transpose()* errors

def plot(vector):
    """
    A funtion to use ggplot to visualize the result
    """
    x = arange(-3, 3, 0.1)
    y = (-vector[0]-vector[1]*x) / vector[2]
    new = pd.DataFrame()
    new['x'] = x
    new['y'] = array(y).flatten()
    infile.classlab = infile.classlab.astype(str)
    p = ggplot(aes(x='x', y='y', colour='classlab'), data=infile) + geom_point()
    return p + geom_line

# Use pandas to manipulate data
if __name__ == '__main__':
    infile = pd.read_csv("https://raw.githubusercontent.com/pbharrin/machinelearninginaction/master/Ch05/testSet.txt", sep='\t', header=None, names=['x', 'y', 'classlab'])
    infile['one'] = 1
    mat1 = mat(infile[['one', 'x', 'y']])
    mat2 = mat(infile['classlab']).transpose()
    result1 = grad_ascent(mat1, mat2, 500)
    print plot(result1)
​r

Wednesday, April 30, 2014

Count large chunk of data in Python

The line-by-line feature in Python allows it to count hard disk-bound data. The most frequently used data structures in Python are list and dictionary. Many cases the dictionary has advantages since it is a basically a hash table that many realizes O(1) operations.
However, for the tasks of counting values, the two options make no much difference and we can choose any of them for convenience. I listed two examples below.

Use a dictionary as a counter

There is a question to count the strings in Excel.
Count the unique values in one column in EXCEL 2010. The worksheet has 1 million rows and 10 columns.
or numbers.
For example,
A5389579_10
A1543848_6
A5389579_8
Need to cut off the part after (including) underscore such as from A5389579_10 to A5389579
Commonly Excel on a desktop can’t handle this size of data, while Python would easily handle the job.
# Load the Excel file by the xlrd package
import xlrd
book = xlrd.open_workbook("test.xlsx")
sh = book.sheet_by_index(0)
print sh.name, sh.nrows, sh.ncols
print "Cell D30 is", sh.cell_value(rowx=29, colx=3)

# Count the unique values in a dictionary
c = {} 
for rx in range(sh.nrows):
    word = str(sh.row(rx)[1].value)[:-3]
    try:
        c[word] += 1 
    except:
        c[word] = 1

print c

Use a list as a counter

There is a question to count emails.
A 3-column data set includes sender, receiver and timestamp. How to calculate the time between the sender sends the email
and the receiver sends the reply email?
The challenge is to scale up the small sample data to larger size. The solution I have has the complexity of O(nlogn), which is only limited by the sorting step.
raw_data = """
    SENDER|RECEIVER|TIMESTAMP
    A B 56
    A A 7
    A C 5
    C D 9
    B B 12
    B A 8
    F G 12
    B A 18
    G F 2
    A B 20
    """

# Transform the raw data to a nested list
data = raw_data.split()
data.pop(0) # Remove the Head
data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))

# Sort the nested list by the timestamp 
from operator import itemgetter
data.sort(key=itemgetter(2))
for r in data:
    print r

# Count the time difference in a list
c = []
while len(data) != 1:
    y = data.pop(0)
    for x in data:
        if x[0] == y[1] and x[1] == y[0]:
            diff = x[2] - y[2] 
            print y, x, '---->', diff
            c.append(diff)
            break # Only find the quickest time to respond
print c

P.S.

I come up with the O(n) solution below, which utilizes two hash tables to decrease the complexity.
__author__ = 'dapangmao'

def find_duration(data):
    # Construct two hash tables
    h1 = {}
    h2 = {}
    # Find the starting time for each ID pair
    for x in data:
        if x[0] != x[1]:
            key = x[0] + x[1]
            try:
                h1[key] = x[2]
            except:
                h1[key] = min(h1[key], x[2])
    # Find the minimum duration for each ID pair
    for x in data:
        key = x[1] + x[0]
        if h1.has_key(key):
            duration = x[2] - h1[key]
            try:
                h2[key] = duration
            except:
                h2[key] = min(h2[key], duration)
    return h2

if __name__ == "__main__":
    raw_data = """
        SENDER|RECEIVER|TIMESTAMP
        A B 56
        A A 7
        A C 5
        C D 9
        B B 12
        B A 8
        F G 12
        B A 18
        G F 2
        A B 20
        """

    # Transform the raw data to a nested list
    data = raw_data.split()
    data.pop(0) # Remove the Head
    data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))
    # Verify the result
    print find_duration(data)

Sunday, April 6, 2014

10 popular Linux commands for Hadoop

The Hadoop system has its unique shell language, which is called FS. Comparing with the common Bash shell within the Linux ecosystem, the FS shell has much fewer commands. To deal with the humongous size of data distributively stored at the Hadoop nodes, in my practice, I have 10 popular Linux command to facilitate my daily work.
1. sort
A good conduct of running Hadoop is to always test the map/reduce programs at the local machine before releasing the time-consuming map/reduce codes to the cluster environment. The sort command simulates the sort and shuffle step necessary for the map/redcue process. For example, I can run the piped commands below to verify whether the Python codes have any bugs.
./mapper.py | sort | ./reducer.py
2. tail
Interestingly, the FS shell at Hadoop only supports the tail command instead of the head command. Then I can only grab the bottom lines of the data stored at Hadoop.
hadoop fs -tail 5 data/web.log.9
3. sed
Sine the FS shell doesn’t provide the head command, the alternative solution is to use the sed command that actually has more flexible options.
hadoop fs -cat data/web.log.9 | sed '1,+5!d'
4. stat
The stat command allows me to know the time when the file has been touched.
hadoop fs -stat data/web.log.9
5. awk
The commands that the FS shell supports usually have very few options. For example the du command under the FS shell does not support -sh option to aggregate the disk usage of the sub-directories. In this case, I have to look for help from the awk command to satisfy my need.
hadoop fs -du data | awk '{sum+=$1} END {print sum}'
6. wc
One of the most important things to understand a file located at the Hadoop is to find the number of its total lines.
hadoop fs -cat data/web.log.9 | wc -l
7. cut
The cut command is convenient to select the specified columns at the file. For example, I am able to count the lines for each of the unique groups from the column between the position of #5 and #14.
hadoop fs -cat data/web.log.9 | cut -c 5-14 | uniq -c
8. getmerge
The great thing for the getmerge command is that I am able to fetch all the result after map/reduce to the local file system as a single file.
hadoop fs -getmerge result result_merged.txt
9. grep
I can start a mapper-only job only with the grep command form the Bash shell to search the lines which contain the key words I am interested in. And this is a map-only task.
hadoop jar $STREAMING -D mapred.reduce.tasks=0 -input data -output result -mapper "bash -c 'grep -e Texas'"
10. at and crontab
The at and crontab commnands are my favorite to schedule a job at Hadoop. For example, I would like to use the order below to clean the map/reduce results at midnight.
at 0212
at > hadoop fs -rmr result

Thursday, March 27, 2014

SAS vs. Python for data analysis

To perform data analysis efficiently, I need a full stack programming language rather than frequently switching from one language to another. That means — this language can hold large quantity of data, manipulate data promptly and easily (e.g. if-then-else; iteration), connect to various data sources such as relational database and Hadoop, apply some statistical models, and report result as graph, table or web. SAS is famous for its capacity to realize such a data cycle, as long as you are willing to pay the annual license fee.
SAS’s long-standing competitor, R, still keeps growing. However, in the past years, the Python community has launched a crazy movement to port R’s jewels and ideas to Python, which resulted in a few solid applications such as pandas and ggplot. With the rapid accumulation of the data-related tools in Python, I feel more comfortable to work with data in Python than R, because I have a bias that Python’s interpreter is more steady than R’s while dealing with data, and sometimes I just want to escape from R’s idiosyncratic syntax such as x<-4 or foo.bar.2000=10.

Actually there is no competition between SAS and R at all: these two dwell in two parallel universes and rely on distinctive ecosystems. SAS, Python, Bash and Perl process data row-wise, which means they input and output data line by line. R, Matlab, SAS/IML, Python/pandas and SQL manipulate data column-wise. The size of data for row-wise packages such as SAS are hard-disk-bound at the cost of low speed due to hard disk. On the contrary, the column-wise packages including R are memory-bound given the much faster speed brought by memory. 
Let’s go back to the comparison between SAS and Python. For most parts I am familiar with in SAS, I can find the equivalent modules in Python. I create a table below to list the similar components between SAS and Python.
SASPython
DATA stepcore Python
SAS/STATStatsModels
SAS/Graphmatplotlib
SAS Statistical Graphicsggplot
PROC SQLsqlite3
SAS/IMLNumPy
SAS Windowing EnvironmentQt Console for iPython
SAS StudioIPython notebook
SAS In-Memory Analytics for HadoopSpark with Python
This week SAS announced some promising products. Interesting, they can be traced to some of the Python’s similar implementations. For example, SAS Studio, a fancy web-based IDE with the feature of code completion, opens an HTML server at local machine and uses a browser to do coding, which is amazingly similar to iPython notebook. Another example is SAS In-Memory Analytics for Hadoop. Given that the old MapReduce path for data analysis is painfully time-consuming and complicated, aggregating memory instead of hard disk across many nodes of a Hadoop cluster is certainly faster and more interactive. Based on the same idea, Apache Spark, which fully supports Python scripting, has just been released to CDH 5.0. It will be interesting to compare Python and SAS’s in-memory ability for data analysis at the level of Hadoop.
Before there is a new killer app for R, at least for now, Python steals R’s thunder to be an open source alternative for SAS.