Thursday, June 13, 2013

The US airports with most flight routes

The summer just begins and a lot of people start to experience flight delay. I am also interested in seeing which cites in the US are well connected by flights. The OpenFlights project provides free flight information. Then I used R to download the data and loaded them into a SQLite database, since I try to keep them as the persistent data. RSQLite facilities R to join and query tables in the database. Finally the flight routes and the airports were visualized by ggplot2 and ggmap.
RankIATA codeCityArriving flight routes
1ORDChicago494
2LAXLos Angeles438
3DENDenver401
4JFKNew York363
5ATLAtlanta339
6DFWDallas-Fort Worth281
7SFOSan Francisco259
8IAHHouston244
9MIAMiami244
10EWRNewark242
Wiki says that Atlanta is the busiest airport in the US according to total passenger boardings. However, from the number of the incoming flight routs, it only ranks 5th, following Chicago, Los Angeles, Denver and New York. Possibly Atlanta is the hub mostly for passengers to do connection. If somebody really loves air traveling, Chicago(with ORD) and New York(with both JFK and EWR) are the two most convenient cities to stay with, because they have the most options.
This post is inspired by one post on the blog Data Science and R
# Import libraries and set up directory
library(ggmap)
library(RSQLite)
setwd("C:/Google Drive/Codes")

# Read data directly from URLs
airport <- read.csv("http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/airports.dat", header = F)
route <- read.csv("http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/routes.dat", header = F)

# Remove the airports without IATA codes and rename the variables
airport <- airport[airport$V5!='', c('V3', 'V4', 'V5','V7','V8','V9')]
colnames(airport) <- c("City", "Country", "IATA", "lantitude", "longitude", "altitude")
route <- route[c('V3', 'V5')]
colnames(route) <- c("Departure", "Arrival")

# Store data to SQLite database
conn <- dbConnect("SQLite", dbname = "air.db")
dbSendQuery(conn, "drop table if exists airport;")
dbWriteTable(conn, "airport", airport)  
dbSendQuery(conn, "drop table if exists route;")
dbWriteTable(conn, "route", route) 
dbDisconnect(conn)

# Manipulate data in SQLite database
conn <- dbConnect("SQLite", dbname = "air.db")
sqlcmd01 <- dbSendQuery(conn, " 
  select a.type, a.city as iata, a.frequency, b.city, b.country, b.lantitude, b.longitude
  from (select  'depart' as type, departure as city, count(departure) as frequency
  from route
    group by departure
    order by frequency desc, type) as a 
  left join airport as b on a.city = b.iata
  order by frequency desc
;")
top <- combine <- fetch(sqlcmd01, n = -1)
sqlcmd02 <- dbSendQuery(conn, " 
  select route.rowid as id, route.departure as point, airport.lantitude as lantitude, airport.longitude as longitude
  from route left join airport on route.departure = airport.iata
  union 
  select route.rowid as id, route.arrival as point, airport.lantitude as lantitude, airport.longitude as longitude
  from route left join airport on route.arrival = airport.iata
  order by id
;")
combine <- fetch(sqlcmd02, n = -1)
dbDisconnect(conn)

# Draw the flight routes and the airports on Google map
ggmap(get_googlemap(center = 'us', zoom = 4, maptype = 'roadmap'), extent = 'device') +
geom_line(data = combine, aes(x = longitude, y = lantitude, group = id), size = 0.1,
          alpha = 0.05,color = 'red4') +
geom_point(data = top, aes(x = longitude, y = lantitude, size = frequency), colour = "blue", alpha = 0.3) +
scale_size(range=c(0,15))

Friday, June 7, 2013

Use Google Trends and SAS to select movies to watch


The newest success story about data science is Google search predicts box office with 94 percent accuracy. I am a frequent movie theater goer, and it will be great if we can implement Google's impressive research result.
There are quite a few offering for this summer. Now I am considering five incoming movies.
Title Date
This is the End Wednesday, June 12
World War Z Friday, June 21
Man of Steel Friday, June 14
Monsters University Friday, June 21
The Internship Friday, June 7

Google Trends reflects what keywords people are searching for, which is a reliable and free data source. Let's use SAS to do some scripting work to generate the URL query based on the get method.
data one;
    input @1 title $25.;
cards;
This is the End
World War Z
Man of Steel
Monsters University
The Internship 
;;;run;

data two(where=(missing(word)=0));
    set one nobs = nobs;
    if _n_ ne nobs then
    title1 = cat(title, "%2C");
    else title1 = title;
    do i = 1 to 10;
        word = scan(title1, i, " ");
        output;
    end;
    keep word;
run;

proc sql noprint;
    select word into: string separated by "%20"
    from two
;quit;

data three;
    length fullstring $500.;
    fullstring = cats("http://www.google.com/trends/explore?q=", "&string", '&geo=US&date=today%203-m&cmpt=q');
run;

proc print;
run;
From SAS, I print the resulting URL. Once I paste the url in a browser, the graphics clearly tells that the box office winners are going to be Man of Steel and World War Z. Finally my choice will be easier. I will surely not miss the two hottest movies.


Wednesday, February 6, 2013

Some differences of the data frames between R and Pandas

Pandas is an emerging open source framework on Python and a substitute to R. Both apply a data structure called DataFrame. Although their data frames look quite similar, there are some cautions for a R programmer who like to play Pandas. A few examples are listed in the table below.
  • Pandas mostly uses the syntax of a.b.c(), while R uses the nesting parenthesis like a(b(c())). The dot separation actually has no sense in R.
  • Python's indexes start from zero, while R's indexes begin with one. The first element in a data frame of Pandas is mydata.ix[0, 0], while the counterpart in R is mydata[1, 1].
  • Pandas has to use DataFrame's *ix property to allow two-way indexing, while R has no such limit.
PurposeRPandas
view datahead(mydata, 3)mydata.head(3)
summary statisticssummary(mydata)mydata.describe()
transpose datat(mydata)mydata.T
sort by two variablesmydata[order(A, -B)]mydata.sort(['A', 'B'], ascending=[1, 0])
select the first elementmydata[1, 1]mydata.ix[0, 0]
select a column by labelmydata['A']mydata['A']
select first 3 rowsmydata[1:3,]mydata[0:3]
select first 3 columnsmydata[, 1:3]mydata.ix[:, 0:3]
select by a variablemydata[mydata$A > 10,]mydata[mydata.A > 10]
select by all variablesmydata[mydata > 10,]mydata[mydata > 10]

Wednesday, January 16, 2013

Make all SAS tables sortable in the output HTML



Most procedures in SAS generate stylish tables in the output HTML files. An option to sort the tables without bothering SAS again will be very handy for people who are familiar with Excel.

The JavaScript libary jQuery has a number of plug-ins. The tablesorter is one of them, which realizes the column-wise sorting. We only need to add several lines of JavaScript below into the very beginning of a SAS output HTML file. Then all tables in the HTML files are able to be sorted easily by clicking on the table head. For example, the resulting HTML from some SAS codes such as proc print data = sashelp.class; run; with the jQuery plug-in will have sortable effect above.

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.0/jquery.min.js"></script>
<script src="http://cdn.jsdelivr.net/tablesorter/2.0.5b/jquery.tablesorter.min.js"></script>
<script>
$(document).ready(function( ) {    
   $('.table').tablesorter({widgets: ['zebra']});
});
</script>
How to do it
First we open the HTML file by any text editor such as Windows Notepad, and the head lines of the file by the SAS command proc print data = sashelp.class; run;  will be like --


The second thing is to copy and paste the 7 lines of JavaSctipt above to the HTML file, just under the title tag. Then all is just done. All SAS tables can be sorted now as long as you have Internet connection. No server or other software is needed.

Another way to do it is to use a SAS macro. Andrew has a nice demo on his blog for implementing it.

Monday, January 14, 2013

SAS and D3.js (3): draw a map to display US cities' murder rates









Robert Allison has given an excellent map example by SAS/GRAPH about the murder rates per 100,000 people in the major US cities. To emulate this map, we should annotate the cities and use the size of the bubbles to represent the murder rates.

Comparing with the GRAPH module in SAS, the SVG based map on a webpage may have some advantages: first the maps will not lose details when zoomed out; second the JSON-formatted raw map data can be easily created to reflected the latest change; third with some more JavaScript attached the map on the HTML file can be empowered with dynamic effects.

1. Input raw data and transform SAS d to JSON 
I only choose the top 20 most violent cities, since I have to manually enter the coordinates data including longitude and latitude. Then I transform them to JSON format. To label all cities from the EXCEL file, a better way is to use R's ggmap package to automate the process of fetching the coordinates.
data top20;
input @1 City $11.    @14 State $2.    @17 Murder_rate @22 rank
@26 coordinates $50.;
cards;
New York     NY    471     1   -74.0059731,40.7143528
Chicago      IL    458     2   -87.6297982,41.8781136
Detroit      MI    363     3   -83.0457538,42.331427 
Los Angeles  CA    312     4   -118.2436849,34.0522342
Philadelphi  PA    302     5   -75.163789,39.952335 
Houston      TX    287     6   -95.3693896,29.7601927
Baltimore    MD    238     7   -76.6121893,39.2903848
New Orleans  LA    174     8   -90.0715323,29.9510658
Dallas       TX    166     9   -96.8004511,32.7801399
Washington   DC    144     10  -77.0363658,38.8951118
Saint Louis  MO    143     11  -90.296630859375, 38.74337300148123
Memphis      TN    132     12  -90.0489801, 35.1495343
Phoenix      AZ    122     13  -112.0740373, 33.4483771
Las Vegas    NV    111     14  -115.172816, 36.114646
Oakland      CA    104     15  -122.2711137, 37.8043637
Kansas City  MO    100     16  -94.5785667, 39.0997265
San Antonio  TX     99     17  -98.4936282, 29.4241219
Indianapolis IN     99     18  -86.1579557, 39.7685825
Jacksonville FL     99     19  -81.655651, 30.3321838
Cleveland    OH     83     20  -81.6954088, 41.4994954
;;;
run;

data top20_json(keep=string);
    set top20 nobs = nobs;
    length string $300.;
    _coordinates = cats('"coordinates":[', coordinates, ']},');
    _properties = cats('"properties":{"name":"', city, '","murder_rate":', Murder_rate, '}},');
    string = cats('{"type":"Feature","geometry":{"type":"Point",',_coordinates, _properties);
    if _n_ = nobs then substr(string, length(string), 1) = ' ';
run;
2. Draw the map and label the cities
D3.js provides the basic us-states.json data. The only thing that needs to do is to attache the cities' murder rate data to the HTML codes. A modern browser such as Chrome will compile the codes and generate the physical map.