### Correlation

#### Introduction

In most statistical packages, correlational analysis is a technique use to measure the association between two variables. A **correlation coefficient** (**r**) is a statistic used for measuring the strength of a supposed linear association between two variables. The most common correlation coefficient is the **Pearson** correlation coefficient. Other types of correlation coefficients are available. Generally, the correlation coefficient varies from -1 to +1.

#### Learning Outcomes

After studying this document you should be able to do the following:

- Conduct and interpret a correlation analysis using interval data.
- Conduct and interpret a correlation analysis using ordinal data.
- Conduct and interpret a correlation analysis using categorical data.

#### Scatterplot

The existence of a statistical association between two variables is most apparent in the appearance of a diagram called a scatterplot. A scatterplot is simply a cloud of points of the two variables under investigation. The diagrams below shows the scatterplots of sets of data with varying degrees of linear association.

**Scatterplots of sets of data with varying degrees of linear association**

**?**

Figure 1 clearly shows a linear association between the two variables and the coefficient of correlation r is +1. For Figure 2, r is -1. In Figure 3, the two variables do not show any degree of linear association at all, r = 0. The scatterplot of Figure 4 shows some degree of association between the two variables and r is about +0.65. From the scatterplot, we can see very clearly whether there is a linear association between the two variables and guess accurately the value of the correlation coefficient. After looking a the scatterplot, we then go ahead and confirm the association by conducting a correlation analysis. However, from the correlation coefficient alone, we can not say much about the linear association between the two variables.

#### How to conduct and interpret a correlation analysis using interval data

Suppose you are interested in finding whether there is an association between people monthly expenditure and income. To investigate this, you collected data from ten subjects as shown on Table 1 below.

**Table 1: Set of paired data**

**?**

#### Preparing the Data set

Start SPSS, define the variables names *income* and *expend* and use the **Define Labels** procedure to provide fuller names such as *Income / month* and *Expenditure / month*. Type in the data and save under a suitable name. To conduct the correlation analysis, it is advisable to produce a scatterplot of the two variables first.

To produce the scatterplot choose:

**Graphs**

**Scatter**

The **Scatterplot** selection box will be loaded to the screen as shown below, with **Simple** scatterplot selected by default. Click on **Define** to specify the axes of the plot. Enter the variables names *income* and *expend* into the **y-axis** and the **x-axis** box, respectively. Click on **OK**.

**The Scatterplot selection box**

**?**

The scatterplot is shown below and it seems to indicate a linear association between the two variables.

**Scatterplot Income/month against Expenditure/month**

**?**

To produce the correlation analysis choose:

**Analyze **

**Correlate**

**Bivariate**

This will open the **Bivariate Correlation** dialog box as shown below. Transfer the two variables to the Variables text box.

**The Bivariate Correlation dialog box**

**?**

Click on **Options** and the **Bivariate Correlation: Options** dialog box will be loaded on the screen as shown below. Click on the **Means and Standard Deviations** check box. Click on Continue and then **OK** to run the procedure.

**The Bivariate Correlation: Options dialog box**

**?**

Let us now look at the output listing.

#### Output Listing of Pearson Correlation Analysis

The output listing starts with the means and standard deviation of the two variables as requested under the **Options** dialog box. This result is shown on the table below.

The next table from the output listing shown below gives the actual value of the correlation coefficient along with its p-value. The correlation coefficient is 0.803 and the p-value is 0.005. From these values, it can be concluded that the correlation coefficient is significant beyond the 1 per cent level. In order words, people with high monthly income are also likely to have a high monthly expenditure budget.

#### How to conduct and interpret a correlation analysis using ordinal data

The **Pearson** correlation analysis as demonstrated above is only suitable for interval data. With other types of data such as ordinal or nominal data other methods of measuring association between variables must be used. Ordinal data are either ranks or ordered category membership and nominal data are records of qualitative category membership. A brief introduction of types of data can be found under *Some Common Statistical Terms* which is located under *Documentation* found on the Content page on the left.

Suppose you are a psychology student. Twelve books dealing with the same psychological topic have just been published by 12 different authors. You and a friend were asked to rank the books in order depending on how well the authors covered the topic. The ranking is show on Table 2 below. Is there any association of the ranking by the two students?

**Table 2: Ranks assigned by two students to each of twelve books**

**?**

#### Preparing the Data set

In the **Data Editor** grid of SPSS, define the two variables, *student1* and *student2*. Enter the data from Table 2 into the respective column.

To obtain the correlation coefficient follow these instructions:

Choose

**Analyze**

**Correlate**

**Bivariate**

This will open the **Bivariate Correlation** dialog box. See diagram above. Select the **Kendall's tau-b** and the **Spearman** check boxes. Notice that by default the **Pearson** box is selected. Click on **OK** to run the procedure.

#### Output Listing of Spearman and Kendall rank correlation

The two tables from the output listing are shown below. Notice that both the **Pearson** and the **Spearman** correlation coefficient are exactly the same 0.965 and significant beyond the 1 per cent level. The **Kendall** correlation coefficient is 0.848 and also significant beyond the 1 per cent level. The different between the **Spearman** and the **Kendall** coefficients is due to the fact that they have different theoretical background. You should not worry about the difference.

The association between the two ranks is significant indicating that the two students ranked the twelve books in a similar way. In fact, close examination of the data on Table 2 shows that, at most, the ranks assigned by the students differ by a single rank.

#### How to conduct and interpret a correlation analysis using categorical data

Suppose that 150 students (75 boys and 75 girls) starting at a university are asked to show their preference of study by indicating whether they prefer art or science degrees. We can hypothesised that boys should prefer science degree and girls art. There are two nominal variables here *group* (boys or girls); and *student's choice* (art or science). The null hypothesis is that there is no association between the two variables. The table below shows the student's choices.

**Table 3: A contingency table**

**?**

Close examination of Table 3 indicate that there is an association between the two variables. The majority of the boys chose science degree while the majority of the girls chose art degree.

#### Preparing the Data set

You need to define three variables here, two coding variables for *group* and *choice*. The third variable is simply the frequency *count *for the choice of degree. Note that no individual can fall into more than one combination of categories. Define the three variables *group*, *choice* and *count*. In the *group* variable, use the code numbers *1* and *2* to represent boys and girls respectively. Similarly, in the *choice* variable use the values *1* and *2* to represent art and science degrees respectively. Type the data into the three columns as shown below.

**Showing coding of data in Data Editor**

**?**

Before we can proceed, we need to tell SPSS that the data in the count column represent cell frequencies of a variable and not actual value. To do this, follow these instructions.

Choose

**Data**

**Weight Cases**

The **Weight Cases** dialog box will be loaded on the screen as shown below. Select the item **Weight cases by**. Click on the variable *count* and on the arrow (>) to transfer it into the **Frequency Variable** text box. Click on **OK**.

**The Weight Cases dialog box**

**?**

To analyse the contingency table data, choose

**Analyze**

**Summarize**

**Crosstabs**

The **Crosstabs** dialog box will be loaded on the screen as shown below. Click on the variable *group* and on the top arrow (>) to transfer *group* into the **Row(s) **text box. Click on variable *choice* and then on the middle arrow (>) to transfer *choice* into the **Column(s)** text box.

**The completed Crosstabs dialog box**

**?**

Click on **Statistics** to open the **Crosstabs: Statistics** dialog box. See diagram below. Select the **Chi-square** and **Phi and Cramer's V** check boxes. Click on **Continue** to return to the **Crosstabs **dialog box.

**The completed Crosstabs: Statistics dialog box**

**?**

Click on **Cells** at the foot of the **Crosstabs** dialog box to open the **Crosstabs: Cell Display** dialog box. See diagram below. Select the **Expected** check box. Click on **Continue** and then **OK** to run the procedure. We have computed the cell frequencies to ensure that the prescribed minimum requirements for the valid use of chi-square have been fulfilled, i.e. a cell frequency should not be less than 5.

**The Crosstabs: Cell Display dialog box**

**?**

#### Output Listing for Cross tabulation

The first table from the output listing shown below gives a summary of variables and the number of cases.

The table below shows the observed and expected frequencies as requested in the **Crosstabs: Cell Display** dialog box. Notice that none of the expected frequencies is less than 5.

The table below gives the Chi-square statistics for the contingency table. It can be concluded that there is a significant association between the variables *group* and *choice*, as shown by the p-value (less than 0.01).

The **Phi** and **Cramer's V** coefficients (shown on the table below) of 0.401 gives the strength of the association between the two variables.

#### Conclusion

You should now be able to perform and interpret the results of correlational analysis using SPSS for interval, ordinal and categorical data.