The correlation coefficient allows researchers to determine if there is a
possible linear relationship between two variables measured on the same
subject (or entity). When these two variables are of a continuous nature (they
are measurements such as weight, height, length, etc.) the measure of
association most often used is Pearson’s correlation coefficient.
This association may be expressed as a number (the correlation coefficient)
that ranges from –1 to +1. The population correlation is usually expressed as
the Greek letter rho (r) and
the sample statistic (correlation coefficient) is r.
The correlation measures how well a straight line fits through a scatter of
points when plotted on an x – y axis. If the correlation is positive, it
means that when one variable increases, the other tends to increase. If the
correlation is negative, it means that when one variable increases, the other
tends to decrease. When a correlation coefficient is close to +1 (or –1), it
means that there is a strong correlation – the points are scattered along a
straight line. For example, a correlation r = 0.7 may be considered
strong. However, the closer a correlation coefficient gets to 0, the weaker
the relationship, where the cloud (scatter) of points is not close to a
straight line. For example, a correlation r = 0.1 might be considered
weak. For scientific purposes, a ttest is utilized to determine if the
correlation coefficient is “strong” or “significant” or not. This will be
discussed later.
Assumptions: Before using the Pearson correlation coefficient as a
measure of association, you should be aware of its assumptions and
limitations. As mentioned earlier, this correlation coefficient measures a
linear relationship. That is, the relationship between the two variables
measures how close the two measurements form a straight line when plotted on
an xy chart. Therefore, it is important that data be graphed before the
correlation is interpreted. For example, it is possible that data, when
plotted, may show a curved relationship instead of a straight line. When this
is the case, a Pearson correlation may not be the best measure of association.
There are other conditions when a correlation coefficient may appear
important, but when considered in light of a graph, is not a good measure of
relationship. In the following graphs, all of them have a
correlation coefficient of about 0.72, yet most do not fit the assumption
of a linear relationship. To avoid misinterpreting a correlation, always
accompany the calculation with a graph.
Another assumption of correlation is that the both of the variables (the
measurements) be of continuous data measured on an interval/ratio scale. Data
that are not continuous, such as categorical (i.e. hair color) or binomial
(i.e., gender) data would not be acceptable. Also, each variable should be
approximately normally distributed.
The SAS procedure most often used to calculate correlations is PROC CORR. The
syntax for this procedure is:
PROC CORR
<options>; <statements>;
The most commonly used option is
DATA=datsetname;
The most commonly used information statements are:
VAR
variablelist;
BY varlist
As an example, to find the correlations between variables in the SOMEDATA data
set use the following program (PROCCORR1.SAS) (Also requires the file
SOMEDATA.SAS7BDAT.)
* ASSUMES YOU HAVE A SAS LIBRARY NAMED MYDATA
* THAT INCLUDES THE FILE SOMEDATA.SAS7BDAT;
ods
rtf;
PROC
CORR
data=mydata.somedata;
VAR
AGE TIME1TIME2;
TITLE
'Example correlation calculations using PROC CORR';
run;
ods
rtf
close;
The (partial) output from this program is:
Pearson Correlation Coefficients, N = 50
Prob > r under H0: Rho=0 

AGE 
TIME1 
TIME2 
AGE
Age on Jan 1, 2000 
1.00000

0.50088
0.0002 
0.38082
0.0064 
TIME1
Baseline 
0.50088
0.0002 
1.00000

0.76396
<.0001 
TIME2
6 Months 
0.38082
0.0064 
0.76396
<.0001 
1.00000

The output includes descriptive statistics on each variable and a table of
Pearson Correlation Coefficients (r). For example, the correlation
between AGE and TIME1 is 0.50088, or r=0.50088. The number under each
correlation is a pvalue. It tests to see if r is statistically
significant. In statistical terminology, this is a test of the following
hypotheses
H_{0}: rho = 0 (the null
hypothesis)
H_{a}: rho <> 0 (the
alternative hypothesis)
If the pvalue for the test is small (usually less than 0.05) then the
conclusion is that rho is not 0, thus the relationship is
statistically significant. A research will then have to make a
professional judgment to determine if the association is significant in terms
of the experiment performed.
Care must be taken when interpreting a statistically significant correlation.
If your sample size is small or not representative of the population from
which you sampled, you may not be able to generalize the correlation to your
intended population. Also, a cause and effect relationship cannot be
inferred except under special conditions when you have designed the study
specifically to detect those phenomena.
Note – to have the program output both PEARSON and SPEARMAN (nonparametric)
correlations, use the statement:
PROC
CORR
data=mydata.somedata
PEARSON
SPEARMAN;
To observe a scatterplot for each correlation, use this slight variation on
the program (PROCCORR2.SAS). Notice the addition of the ODS GRAPHICS
statements and PLOTS=MATRIX.
ODS
RTF;
ODS
GRAPHICS
ON;
PROC
CORR
DATA=MYDATA.SOMEDATA
PLOTS=MATRIX;
VAR
AGE TIME1TIME2;
TITLE
'Example correlation calculations using PROC CORR';
RUN;
ODS
RTF
CLOSE;
ODS
GRAPHIC
OFF;
This produces the following matrix of scatterplots:
Note that in this
plot the upper and lower half are identical – the plot is symmetric, so you
really only have to look at half of it.
End of tutorial
See
http://www.stattutorials.com/SAS