Descriptive Statistics
in Excel
Usually when you create
or acquire a set of numbers you will want to examine the data to learn
more about its distribution and to discover information such as the
minimum and maximum values and to determine if there are outliers. This is
an important step in any analysis since it helps you understand if you
data meet assumptions required by other analyses such as ttests and
regression.
For this example, we’ll look at the data set called
EXAMPLE.XLS. The first few records are shown here:
GROUP 
AGE 
TIME1 
TIME2 
TIME3 
TIME4 
STATUS 
A 
12 
22.3 
25.3 
28.2 
30.6 
5 
A 
11 
22.8 
27.5 
33.3 
35.8 
5 
B 
12 
22.8 
30.0 
32.8 
31.0 
4 
A 
12 
18.5 
26.0 
29.0 
27.9 
5 
B 
9 
19.5 
25.0 
25.3 
26.6 
5 
B 
11 
23.5 
28.8 
34.2 
35.6 
5 
C 
8 
22.6 
26.7 
28.0 
33.4 
3 
B 
8 
21.0 
26.7 
27.5 
29.5 
5 
As an initial example, we’ll examine the variable
AGE.
1.
In Excel, select Tools/Data Analysis/Descriptive Statistics.
(If the Data Analysis option is not on your Tools menu, you must first
install it using Tools/Add ins…)
2.
Select the input range for the AGE variable. In this case it is
$B$2:$B$51.
3.
Be sure to select the check boxes Summary Statistics and Confidence
level for mean (95% is okay). The output created is shown here:
Column1 


Mean 
10.46 
Standard Error 
0.343107052 
Median 
11 
Mode 
12 
Standard
Deviation 
2.42613323 
Sample Variance 
5.886122449 
Kurtosis 
0.261061479 
Skewness 
0.511921947 
Range 
11 
Minimum 
4 
Maximum 
15 
Sum 
523 
Count 
50 
Confidence
Level(95.0%) 
0.689499422 
Information you should
notice includes:
1.
Search for outliers: Look at the Minimum and Maximum values
to see if these values fall within your expected range for these data. If
a value is unexpectedly small or large, you should examine your original
data to see if it was miscoded. If there are corrections that need to be
made, make them before continuing. If you have values that are
unexpectedly large are small, but are actual values, it may indicate that
your data are not normally distributed. In that case you may consider
using nonparametric procedures in further analyses. It may also indicate
that the mean is not the best value to report to describe the central
tendency of this data set.
2.
Symmetry: Another measure that helps you decide normality is
Skewness and Kurtosis. The Skewness measure indicates the level of
nonsymmetry. If the distribution of the data are symmetric then skewness
will be close to 0 (zero). The further from 0, the more skewed the data.
A negative value indicates a skew to the left. How do you tell if the
skewness is large enough to case concern. Excel doesn’t give you this
value, but a measure of the standard error of skewness can be calculated
as =SQRT(6/N) or =SQRT(6/50) which is 0.346. If the skewness is more than
twice this amount, then it indicates that the distribution of the data is
nonsymmetric. In this case 0.346 * 2 = 0.69. The skewness reported by
Excel is 0.512, so the data can be assumed to be fairly symmetric
(although somewhat marginally so.) However, this does NOT indicate that
the data are normally distributed.
3.
Kurtosis is a measure of the peakedness of the data. Again,
for normally distributed data the kurtosis is 0 (zero). As with skewness,
if the value of kurtosis is too big or too small, there is concern about
the normality of the distribution. In this case, a rough formula for the
standard error for kurtosis is =SQRT(24/N) = 0.692. Twice this amount is
1.39. Since the value of kurtosis falls within two standard errors (0.26)
the data may be considered to meet the criteria for normality by this
measure. These measures of skewness and kurtosis are one method of
examining the distribution of the data. However, they are not definitive
in concluding normality. You should also examine a graph (histogram) of
the data and consider performing other tests for normality such as the
ShapiroWilk or KolmogorovSmirnov test (not provided by Excel).
4.
Estimate of central tendency: For normally distributed data
the mean (arithmetic average) is the typical value to use in a
report or journal article. However, this point estimate is of limited
value without some estimate of the varibility of the data. Therefore you
should at least report three values – the mean, the standard error of the
mean, and the sample size. In this case, those values are 10.5, 0.34 and
50. (Typically report means to one more decimal place than what was
measured and the standard error to two decimal places beyond the unit
measure.) Many journals also prefer that you report a confidence limit on
the mean. This is calculated (estimated) as MEAN + or – t_{(.05,N1)}
* SEM. The value of t is reported by Excel as the confidence level =
0.6895. Thus, the lower bound for the 95 % CI is given as = 10.46 – 0.6895
= 9.77 and the upper limit – 10.46 + 0.6895 = 11.15. The median is
another measure of central tendency and is usually reported when the data
are not normally distributed. The mode, or the most frequent value,
is a third measure of central tendency. In the case of the median or mode,
the range is often given as a measure of variability, although a better
measure is the interquartile range (not reported by Excel).
5.
Measures of variability: The measures of variability
reported in Excel’s descriptive statistics include the standard error, the
standard deviation and the variance as well as the range. The first three
measures are related in the following ways:
Standard Deviation =
SQRT(Variance)
Standard Error = Standard
Deviation / SQRT(N)
6.
Coefficient of Variation: Another measure often reported is
the coefficient of variation. This measure provides a unitless measure of
the variation of the sate by translating it into a percentage of the mean
value. This measure not provided by Excel, but is easily calculated by the
formula:
CV = (Standard Deviation /
Mean) * 100
For this data, the value,
expressed as a percentage, is
CV = (0.343/10.46) * 100 =
3.279
This CV value can be used
when comparing two samples that have different means and standard
deviations. When the mean is close to 0, the CV value becomes of little
use.
7.
Visualizing your data: It is always a good idea to examine
your data visually to understand what’s going on. To produce a histogram
of the AGE variable, select Tools/Data Analysis/Histogram. Select the
range for the AGE variable, $B$2:$B$51, and check the option Chart output.
This produces the following table and chart:
4 
1 
5.571429 
0 
7.142857 
6 
8.714286 
4 
10.28571 
11 
11.85714 
8 
13.42857 
17 
More 
3 
This is not the best
looking histogram because Excel selects its own bin sizes (class
intervals) on the Xaxis, which in this case are less than desirable. You
can improve on the plot by creating a list of bin sizes such as in the
table below:
Bin Sizes 
5.00 
7.00 
9.00 
11.00 
13.00 
15.00 
Redo the histogram, this
time indicating the bin sizes as the “Bin Range.” This produces the
following histogram:
This plot is much cleaner
looking and suitable for reporting. It shows that the data are visually
not symmetric with more values appearing to the right of the mean (10.46)
than we would expect in normally distributed data.
This tutorial shows ways
for you to examine a variable for normality, symmetry, and to visually
inspect the distribution of your data. These are important beginning
exercises you should perform on your data before using it any other
analysis.
End of
tutorial
See
http://www.stattutorials.com/EXCEL