INTRODUCTION

PROC SQL is the implementation of the SQL
syntax in SAS. It first appeared in SAS 6.0, and since then has been very popular
for SAS users. SAS ships with
a few sample data sets in its HELP library, and SASHELP.CLASS is one of them.
This dataset contains 5 variables including name, weight, height, sex and age
for 19 simulated teenagers, and in this paper I primarily use it for the
demonstration purpose. Here I summarize the 10 interesting tricks and tips
using PROC SQL. At the beginning, I first make a copy of SASHELP.CLASS at the
WORK library and transform the row number of the data set to a new variable

*obs*.**data**class;

set
sashelp.class;

/* Give an index for
each child*/

obs = _n_;

**run**;

##
**1. Calculate the
median of a variable**

With
the aggregating HAVING clause and some self-join techniques, PROC SQL can easily
calculate the median for a variable.

**proc**

**sql**;

select
avg(weight) as Median

from
(select e.weight

from
class e, class d

group
by e.weight

having
sum(case when e.weight =
d.weight then

**1**else**0**end)
>= abs(sum(sign(e.weight -
d.weight))));

**quit**;

**2.**

**Draw a horizontal histogram**

A histogram visualizes the distribution pattern of a
variable. PROC SQL can draw a horizontal histogram by showing the frequency
bars with a few asterisks for each level of the variable

*age*.**proc**

**sql**;

select
age, repeat('*',count(*)*

**4**) as Frequency
from
class

group
by age

order
by age;

**quit**;

**3.**

**Return the running total for a variable**

A running total is the summation of a sequence
of numbers which is updated each time with the increase of the observations. In
the example below, I calculate the running total and save them as a new
variable

*Running_total*by the SUM function and a conditional statement, which logically is similar to an example in SAS/IML[1].**proc**

**sql**;

select
name, weight,

(select
sum(a.weight) from class as

a where
a.obs <= b.obs) as Running_total

from
class as b;

**quit**;

**4.**

**Report the total number for a variable**

PROC SQL is a flexible way to find the total number for any variable by
its set operator UNION and the SUM function. In the example, the total number
of the variable

*weight*is reported at the bottom of the output table.**proc**

**sql**;

select
name, weight

from
class

union
all

select
'Total', sum(weight)

from
class;

**quit**;

**5.**

**Retrieve the metadata for a data set**

SAS stores the metadata at
its DICTIONARY
data sets. PROC SQL can visit the directory,
retrieve the column detail, and return the information to the users.

**proc**

**sql**;

select
name, type, varnum

from
sashelp.vcolumn

where
libname = 'WORK' and
memname = 'CLASS';

**quit**;

**6.**

**Rank a variable**

Besides
the designated ranking procedure PROC RANK in SAS, PROC SQL can also do some
simple ranking as well.

**proc**

**sql**;

select
name, a.weight, (select count(distinct
b.weight)

from
class b

/* Rank by the
ascending order for the weight variable*/

where
b.weight <= a.weight) as rank

from
class a;

**quit**;

**7. Simple**

**random sampling**

PROC SQL is widely used in simple random
sampling. For example, I randomly choose 8 observations by the OUTOBS option at
the PROC statement. The randomization process is realized by the RANUNI
function at the ORDER BY statement with a seed 1234.

**proc**

**sql**outobs =

**8**;

select
*

from
class

order
by ranuni(

**1234**);**quit**;

**8.**

**Replicate a data set without data**

In PROC SQL, it is a
fairly straightforward one-line statement to create a new empty data set while
keeps all the structure of the original data set.

**proc**

**sql**;

create
table class2 like
class;

**quit**;

**9.**

**Transpose data**

Usually DATA step ARRAY and PROC TRANSPOSE
allow SAS users to restructure the data set, while PROC SQL sometimes is an
alternative solution. For instance, if we need a wide-to-long operation to list
the names of the children by their gender in the CLASS date set, then PROC SQL can
fulfill the functionality through the combinations of some queries and subqueries.

**proc**

**sql**;

select
max(case when sex='F'

then
name else ' ' end)
as Female,

max(case
when sex='M'

then
name else ' ' end)
as Male

from
(select e.sex,

e.name,

(select
count(*) from class d

where
e.sex=d.sex and e.obs < d.obs) as
level

from
class e)

group
by level;

**quit**;

**10.**

**Count the missing values**

Another
advantage of PROC SQL is that its NMISS function works for both numeric and
character variables [2], which makes PROC SQL an ideal tool for missing value
detection.

**proc**

**sql**;

select
count(*) 'Total', nmiss(weight)

'Number of missing
values for weight'

from
class;

**quit**;

**CONCLUSION**

The
combination of SAS’s powerful functions and the SQL procedure will benefit SAS
users in data management and descriptive statistics.

Nice tips! Data step can do #8 (replicate dataset without data) like:

ReplyDeletedata a;

set sashelp.class;

stop;

run;

But as usual, the SQL code probably reads better. (And I say that as a datastep die-hard : )

--Q.

Q -- Thanks. Great to know it. Charlie

DeleteCool

ReplyDeleteT -- Thanks. I try to apply what I learned from you at Orlando ^-^. Charlie

DeleteHi Tricia

Deletehow we can make a SAS and SQL connection.

Thanks

Sandeep

This is awesome, thanks! I especially like the median; it seems ridiculous that you can't just use median() as an aggregate function, but I assume there's some reason for it...

ReplyDeleteCould I suggest that you update #10 with variable names and a semicolon so it runs?

proc sql;

select count(*) as total, nmiss(weight) as missing, n(weight) as nonmissing

from class;

quit;

Max -- thanks. That is very thoughtful. Charlie

Deletetrue median is definitely annoying.

DeleteSince visuals facilitate quick and easy inferences,

ReplyDeletebut precise numbers are needed for reliable inferences,

I was dismayed that the horizontal bar chart does not display the count,

only a string of asterisks that is 4 times the Count PLUS 1.

With a little help from Alexandra Riley,

I came up with a PROC SQL horizontal bar chart drawn with this:

proc sql;

select age, '|',

count(*) as Count,

repeat('*',count(*)-1) as HorizontalBar

from sashelp.class

group by age

order by Count descending;

quit;

I always like to Show Them What's Important with ranking.

Hence, the descending.

With a huge number of bars,

and in a situation where lookup by categorical key is more important,

ranking is inappropriate.

With this data set, the counts happen to be small.

In the general case, they could be large,

and the multiplier by four would cause a problem.

With very large counts,

you might have to increase the SAS linesize,

which does have a maximum of 256.

LeRoy Bessler (Le_Roy_Bessler@wi.rr.com)

Hi! About 1. Calculate the median of a variable:

ReplyDeleteIf you look at the details in the SQL code for calculation the median,

then you find that the intermediate file is of size N*N obs,

where N is the number of obs in the SAS data set.

So this is OK for very small files. But for a file with 10000 obs,

you have an intermediate file of size 100 million obs.

/ Br Anders

Anders Sköllermo Ph.D., Reuma and Neuro Data Analyst

I noticed the same thing - we tried this on one of our 'smaller' datasets (~2.9 million records), and it took forever.

DeleteExcellent solution, but maybe PROC UNIVARIATE will get you there faster on a large dataset.

Just a reminder with SQL, count(*) and count([varname]) are not the same. The first returns a count of the number of records in the dataset/table. The second returns the count of the non missing values for the variable.

ReplyDelete