Interpreting Statistics - Relationships


If you have gathered data from a single sample on two continuous variables that use an interval scale of measurement then you can compute the correlation coefficient (rxy) and the regression equation. The correlation coefficient will tell you how strongly the two variables are associated. The regression equation describes the best straight line that can be fit to the data. You can use the line to predict values on one variable (the DV called Y) from the other variable (the IV called X).

The correlation coefficient tells how accurately we can predict one variable from another, and it ranges from -1.00 to 0.00 to 1.00. A correlation coefficient of 0.00 means that we cannot predict one variable from the other. For example, the correlation between hair color and musical ability is close to zero. As the correlation coefficient gets larger (approaches 1.00) or smaller (approaches -1.00), the ability to predict one variable from another improves. When the value is 1.00 (or -1.00) then you can predict one variable from another with perfect accuracy. Whether the value is positive or negative refers to the nature of the relationship. Negative correlation coefficients are obtained when as one variable increases the other decreases, i.e., the older I get, the fewer hairs I have on my head.

As an example of correlation, let's consider the reliability of a test. When using correlation to measure test reliability you expect a positive correlation because each variable is expected to change in the same manner, i.e., people who got high scores the first time should get high scores the second. Suppose we want to test the reliability of a ten-item test of knowledge about the planet Mars. We get a sample of 10 people who recently attended a lecture on the planetary system. On Monday we give them the test and on Tuesday they take another test that is parallel to the first. We insure that the subjects do not learn anything about Mars between testing sessions and we do not give them any feedback about their scores on the first test administration. The scores on the two administrations will not be exactly the same, because people will guess at the answers they are not sure of, not remember the correct answer, or the questions may not be written clearly, and thus will get a slightly different score. We will correlate the two sets of scores.

The more the two sets of scores are the same the higher will be the correlation. The higher the correlation the better the reliability. For us to assume a test is reliable, the correlation needs to be at least .8 and it is better if the correlation is .9 or higher. The correlation value (rxy) is computed using the sample data and formula below.

DATA FOR CORRELATION EXAMPLE

Subject Test Score on:
Monday (X) Tuesday (Y)
1

2

3

4

5

6

7

8

9

10

5

9

8

7

6

8

10

5

7

5

6

10

8

6

7

8

10

6

9

5

 

 The correlation coefficient (rxy) value of .88 means you can predict one test score from the second and that the error of prediction is relatively small. We would conclude that this test is reliable.

The regression equation is presented below and it is computed in a similar fashion to the correlation coefficient. The regression equation is used to predict values of the DV (Y) from values of the IV (X). The letters S and I stand for the slope and intercept respectively.

The slope and intercept are computed using the two equations below.

You can use the regression equation to generate a straight line that shows all the predictions of Y from all values of X. The regression line shows the prediction. You can compute a regression line for any two variables, but the predictions may or may not be accurate. It is how near the correlation is to 1 or -1 that determines the strength of the relationship. The stronger the relationship, the more accurately one variable can predict the other.

 To interpret a correlation, there are two questions to be answered:

  Is the correlation coefficient significant?

   Is the correlation coefficient large enough to allow the prediction of Y from X without a large error, where error equals the difference between the real value and the prediction.

Question 1: It is possible to obtain a correlation between two variables by chance only. We want to see which one of two hypotheses is true:

  The null hypothesis that rxy is actually 0 even though we obtain a value larger than 0, or

  The alternative hypothesis that rxy is actually greater than 0.

Is the correlation coefficient large enough to be significant? For this we need to evaluate the probability associated with the correlation coefficient. Some computer programs calculate the probability for you. If you do the calculation by hand or the computer program does not output a probability then you need to estimate the associated probability. Refer to one of the methods below. THE COMPUTER PROGRAM OUTPUTS A PROBABILITY - We will use .05 as a criterion probability. If the probability computed by your program is equal to or less than .05, then the probability that your correlation coefficient occurred by chance is relatively small. Thus, it is reasonable to conclude that the correlation did not occur by chance. It may be that the alternative hypothesis is true. If your probability is larger than .05, then the probability that your correlation was obtained by chance is fairly large, and there is probably no relationship. You would then conclude that the null hypothesis (no correlation) is true.

NO PROBABILITY PROVIDED - Use Table 1, Critical Values of the Correlation Coefficient, to estimate the probability associated with your correlation coefficient. The degrees of freedom (df) are n-2 (where n is the number of subjects). Find the degrees of freedom in the column labeled df = n-2 (if the exact df value is not shown, either extrapolate between them or select the next smallest df value), then go across to find the critical value in the column labeled a = .05. Alpha (a) is the probability that a correlation coefficient of the size you obtained can be obtained by chance. The sign on the correlation value (+ or -) indicates the direction of the relationship.  Use the absolute value of the correlation when determining significance. 

If the absolute value of the correlation coefficient from the data is the same size or larger than the critical value from Table 1, then the following statements are true. The probability that the data correlation coefficient was obtained by chance is .05 or smaller, i. e., at least one time in twenty a correlation as large as yours occurs by chance. Since this is a small probability you can conclude that the data correlation did not occur by chance, that it indicates a significant relationship was present, and that the alternative hypothesis may be true. 

If the data correlation coefficient is smaller than the tabled value, then the probability that the correlation was obtained by chance is larger than .05, and there is no relationship. You probably then can conclude that the null hypothesis is true.

Question 2: How accurate is the prediction? If the correlation is significant, we need to determine the strength of the relationship. The correlation coefficient is difficult to interpret in a straight foreword manner. Remember when we discussed measures of association; the value told us how much our error of guessing was reduced when we used age to predict opinion. We can convert rxy to an estimate of error reduction. The relationship of the correlation coefficient to error reduction is illustrated in Table 2.

It depends on how much accuracy of prediction we desire whether our obtained correlation coefficient is large enough to support our hypothesis. As can be seen the error of prediction is not reduced greatly until the correlation is around .7. In research with human subjects a reduction in error of prediction of about 30% (.7 or higher) indicates a strong relationship between Y and X. While 30% is not a large reduction it does indicate that X is related to Y even though it is not a perfect predictor. It means further that there are other independent variables, not studied, that effect the dependent variable. If you need to predict with great accuracy, say in diagnosing a disease, a correlation of .95 may not be high enough.

Table 1. Critical Values of the Correlation Coefficient

df = (n-2)

a = .05

1

2

3

4

5

6

7

8

9

10

12

14

16

18

20

25

30

35

40

45

50

60

70

80

90

100

.997

.950

.878

.811

.754

.707

.666

.632

.602

.576

.532

.497

.468

.444

.423

.381

.349

.325

.304

.288

.273

.250

.232

.217

.205

.195

 

Adapted from: Fisher & Yates, Statistical Methods for
Research Workers
. Edinburgh, Oliver and Boyd Ltd.

 

Table 2. Relationship Between Correlation
Coefficient and Reduction in Error of Prediction

When Rxy equals

Error of prediction is reduced approximately

 

.1

.2

.3

.4

.5

.6

.7

.8

.9

.95

.99

.999

 

1%

2%

5%

8%

13%

20%

29%

40%

56%

69%

86%

96%


The above table is based on an explaination provided by Judah Rosenblatt (2001, pp 49-50). Any errors in application are the author's not Rosenblatt's.


Thus, to interpret a correlation we combine our answers to the two questions into four categories:

Non-significant correlation, meaningless predictions.

Significant correlation below .7, i.e., a real but weak relationship, much error of prediction.

Significant correlation above .7, i.e., a strong relationship, error of prediction is moderate.

Significant correlation above .9, i.e., a very strong relationship with good accuracy of prediction.

The above rules only hold when the sample size is larger than ten. If you have fewer than ten subjects it is best not to use correlation.