Friday, February 18, 2011

Visualize decision tree by coding Proc Arboretum


Decision tree (tree-based partition or recursive partition) dominates the top positions of recent data mining competitions. It is easy to realize and explain like logistic regression, but usually brings more powers (AUC). Not like SVM, neural network or random forest, decision tree is quick and resource-efficient. It is really a blessing for big data. No wonder regression tree and classification tree are widely used in industry: thanks to Google’s application on its Gmail, I am seldomly harassed by spam.

The documents about Proc Arboretum are still scarce. From my experience, Proc Arboretum is pretty robust and powerful. It divides input variables as different categories: nominal/interval/interval. It allows users to trim the tree interactively. It also generates a number of statistics about portioning criterion. And it supports an integrated training-validation-scoring flow and even code output. Overall, it satisfies my wildest dream about decision tree. However, since it is one of the pillars of SAS Enterprise Miner, SAS Institute probably feels reluctant to disclose more detail of this procedure to those who have the license and are more willing to do hard coding themselves. SAS programmers can hardly build physical tree it if without Enterprise Miner. Some resort to R instead, because R’s package ‘rpart’ is now stable for production purpose and provides convenient functions to show the trees.

SAS’s plotting procedures could visualize the results by Proc Arboretum. In the example, I still used the example SASHELP.CARS to explore if the decision tee recognizes the origin of a car, such as Asia/Europe/US. With an ancient procedure Proc Netdraw, I built a not-good-looking tree. By other high-level plotting SG procedures, I displayed some deeper information according to the results by Proc Arboretum, such as the significance of variables or the predication accuracy.

Reference: The ARBORETUM Procedure. 'www.sasenterpriseminer.com/documents/proc_arbor.pdf'.

********(1) CONSTRUCT DECISION TREE AND OUTPUT DATASETS********;
filename outcode 'h:\outcode.txt';
proc arboretum data=sashelp.cars ;
    target origin / level=nominal;
    input MSRP Cylinders Length  Wheelbase MPG_City 
    MPG_Highway Invoice Weight Horsepower/ level=interval;
    input EngineSize/level=ordinal;
    input  DriveTrain Type /level=nominal;
    code file=outcode;
    save   IMPORTANCE=imp1 MODEL=model1  NODESTATS=nodstat1  
    RULES=rul1 SEQUENCE=seq1  STATSBYNODE= statb1 SUM=sum1;
run;
quit;
********END OF STEP(1)***********;

********(2) VISUALIZE DECISION TREE RESULTS************;
****(2.1) SIGNIFICANCE OF VARIABLES*****;
proc sgplot data=imp1;
    vbar name/response=importance;
run;

****(2.2) INTERACTION AMONG THE MOST THREE SIGNIFICANT VARIABLES****;
proc sgscatter data=sashelp.cars;
    plot  invoice*(wheelbase length)/group=origin;
run;

****(2.3) CONSTITUENTS OF EACH NODE****;
proc sgplot data=statb1;
    vbar node/response=STATVALUE group=CATEGORY;
run;

****(2.4) BUILD PHYSICAL TREE****;
proc sql;
    create table treedata as
    select a.parent as act1, a.node, b.NODETEXT, b.U_Origin
    from nodstat1 as a, nodstat1 as b
    where a.parent=b.node
    union
    select c.node as act1, . as node, c.nodetext, c.U_Origin
    from nodstat1 as c
;quit;

data treedata1;
    set treedata;
    if U_Origin='Asia' then _pattern=1;
    else if U_Origin='Europe' then _pattern=2;
    else  _pattern=3;
run;

pattern1  c=green; pattern2 v=s c=red;  pattern3 v=s c=blue; 
/*NOTE: USE PROC NETDRAW TO REALIZE PHYSICAL TREE*/
footnote   c=green   'Asia  '  c=red     'Europe '    c=blue    'USA'; 
proc netdraw data=treedata1 graphics; 
     actnet /activity=act1 successor=NODE  id=(NODETEXT) tree compress rotate     rotatetext font=simplex arrowhead=0 htext=6; 
run;
footnote ' ';

****(2.5) SHOW ALL PARTITION STATISTICS *****;
proc transpose data=seq1 out=seq1_t(rename=(col1=value));
    var _ASSESS_  _MISC_ _MAX_  _SSE_  _ASE_;
    by _NW_  notsorted;
run; 
      
proc sgpanel data=seq1_t;
    panelby _name_/UNISCALE=column COLUMNS=4 rows=2 SPACING=5 NOVARNAME;
        step x=_NW_  y=value;
    colaxis TYPE= DISCRETE grid;
run; 

****(2.6) SHOW FINAL PREDICATION ACCURACY****;  
proc sort data=sum1( drop=_total_) out=sum1_s;
    by _TARGET_;
    where _STAT_='N' AND _TARGET_ ^= 'TOTAL';
run;

proc transpose data=sum1_s out=sum1_t(rename=(col1=Number));
    var _numeric_;
    by _TARGET_;
run;

proc sgplot data=sum1_t;
    vbar _LABEL_/response=Number group=_TARGET_;
run;
********END OF STEP(2)*********;

*********END OF ALL CODING*****TESTED ON PC SAS 9.2 ***********;

2 comments:

  1. You should provide the dataset with the code.

    ReplyDelete
    Replies
    1. The data set is shipped with SAS. It can be found at the HELP directory.

      Delete