Dashboard > Survey Research & Design in Psychology - 7126/6667 > ... > Tutorial - Correlations & data entry > Tutorial - Linear correlation

View Info

Tutorial - Linear correlation

Contents

Set-up

Assumed knowledge

  • Francis Section 3.1 "Relationships Between Metric Variables"
  • Francis Section 3.2 "Relationships Between Categorical Variables"
  • Francis Section 4.3 "Recoding Variables"

Downloads

Syntax and output

SPSS syntax and output for exercises in this tutorial:

Data files

SPSS data files for exercises in this tutorial:

Executable

Executable files for exercises in this tutorial:

Introduction

Introductory quiz

Types of correlation and level of measurement

General advice

A recommended strategy for tackling correlational analyses is:

  1. Determine the level of measurement for each variable
  2. Obtain univariate descriptive statistics and graphical displays for each variable. Pay attention to:
    • mis-entered data
    • frequency, central tendency, distribution
  3. Recode as necessary
  4. Obtain a bivariate graph (e.g,. clustered bar graph or scatterplot)
  5. Create tables (e.g., crosstabs with separate tables for row and column %s) and relevant correlational statistics
  6. Interpret/conclude

Types of analyses

Nominal by nominal

About

  • Explore relationship between two nominal variables
  • Data file: qfsall_2.sav
  • Statistical tasks: Frequency table, chi-square, and Phi or Cramer's V
  • Statistical technique: Crosstabs (also known as contingency tables), chi-square (c2; as an inferential test) and Phi (¿) or Cramer's V (as correlations (or effect sizes)) are used for analysing the degree of relationship (or dependence) between two nominal variables.
  • Phi (¿) is used for 2 x 2, 2 x 3 or 3 x 2 tables
  • Cramer's V is is used for >= 3 x 3 tables
  • These are non-parametric tests which do not rely much on assumptions about distribution. But you should check that there is a minimum expected frequency of 5 per cell. You can obtain the minimum expected frequency via descriptives - crosstabs - cells - expected. If you don't have >5 minimum expected frequency per cell, you should recode the data into fewer categories.
  • Note that the sign (+ or -) of Phi (¿) doesn't mean much because there is no meaningful order to the way the variables are coded.

Detailed solution

  1. Is there an association between Gender (a1) and Belief in God (b4)?
    • Check univariate frequencies and bar chart.
      • This should indicate that Belief in God has an extra category which needs to be removed (recoded as system missing.
    • Recode Belief in God (b4) into a new variable (b4r) for which the mis-entered data (the 3s) and the "sort of" responses (the 2s) are missing data. Add variable and value labels to b4r. Check frequencies and bar chart.
    • Crosstabs (a1 by b4r)
      • Statistics (Chi square - Phi and Cramers V)
      • Cells (Expected - Row and Column %s)
      • Adding all these extra statistics and cell statistics into one analysis can make the output confusing to interpret; to break it down, in the demo syntax, several different crosstabs commands are provided, with each asking for only one additional cell statistic or cell statistic.
    • Bar graph - clustered (with percentage on Y axis)
    • ¿2 (1, 127) = .10, p = .76; ¿ = .03 (small); there is no evidence of relationship
  2. Is there an association between Snoring (b2) and Smoking (b3)?
    • Univariate frequencies, bar chart (Snoring) and histogram (Smoking).
    • Recode smoking from continuous to dichotomous - add labels - check frequencies and bar chart.
    • Crosstabs, with Statistics (chi square and phi/Cramers) and Cells (Expected and Row and Column %s)
    • Bar graph - clustered (with percentage on Y axis)
    • ¿ is = -.21 and significant, p = .004, N = 188
    • Interpretation: Smokers are almost twice as likely to snore as non-smokers, but be careful with causality - this relationship could be due to a third variable (e.g., age?)
  3. Is there an association between favourite season (a8) and favourite sense (a13)?
    • Univariate frequencies and bar charts.
    • Recode favourite sense (a8) into a8r, changing the mis-entered data (0) and other (6) to system missing - add labels - check frequencies and bar chart.
    • Try out a "stacked area graph". (Graph - Area - Stacked)
    • Crosstabs (using as suggested in Q2-3)
    • ¿2 (1, 188) = 8.07, p = .005; Cramer's V = .23 (small-moderate effect and significant)
    • Interpretation: There is a different profile of favourite senses, depending on favourite season (e.g., most 50% of Summer and Spring people are Visual people. Winter people, in contrast, tend to prefer Taste and Smell).
  4. Is there an association between type of household and whether or not the household has chickens?
    • The datafile (chickens.sav) contains (hypothetical) data for two categorical variables: resid (urban/rural) and chickens (yes/no).
    • Univariate frequencies and bar charts.
    • ¿2 (1, 90) = 18.34, p < .001; ¿ = -.45 (moderate effect and significant)
    • Interpretation: Rural households are twice as likely to own chickens compared to urban households.

Dichotomous by interval/ratio

  • (rpb)
  • qfsall_2.sav
  • The point biserial correlation (rpb) is or analysing the relationship between a dichotomous and an interval/ratio variable.
  • Simply compute the product-moment correlation.
  • Interpret taking into account the direction of coding for the dichotomous scale.
  • Note that the significance test for a rpb is equivalent to an independent samples t-test.
  1. What is the relationship between Gender (a1; dichotomous) and Australianness (b12; interval)?
    • Examine univariate frequencies and bar graphs
    • Draw three different types of graphs for this bivariate relationship:
      • Scatterplot - a1 and b12
        • Double-click chart editor to go into chart editor
        • Double click on a data point and change to point bins
        • Add line of best fit ("add fit line to total")
      • Bar graph - with mean of Australianness (b12) on the Y-axis and Gender (a1) on the X-axis (category axis)
      • Error-bar graph - with Australianness (b12) as the (dependent) variable and Gender (a1) as the X-axis (category axis)
    • Correlate - bivariate - a1 and b12
      • rpb = -.04, p = .62, N = 189 (very small, slightly negative and non-significant, i.e., males (coded as 0) in the sample perceive themselves as very slightly more Australian then females (coded as 1) but this result could have come about by chance)
  2. What is the relationship between Belief in God (recoded to dichotomous - i.e., b4r) and number of Countries (b8) visited?* Scatterplot - chart options - change to point bins and add line of best fit ("add fit line to total")
    • There may be some outliers who believe in God and who have visit a lot of countries (e.g., over 20) who are "creating" the small, non-significant correlation.
  • Correlation - bivariate
  • rpb = -.10, p = .29, N = 127 (small, negative and non-significant, i.e., people who believe in God (coded 0) have travelled to more countries than people who don't believe in God (coded 1). Note, however, this could be due to some outliers and/or a third variable, such as age.

Interval/ratio by Interval/ratio

  • Product-moment correlation
  • Pearson's correlation or product-moment correlation is for analysing the linear relationship between two continuous (or near continuous e.g., interval > ~5 categories data) variables
  1. What is the relationship between Australianness (b12) and Femininity/Masculinity (b13)? (interval by interval)
    • qfsall_2.sav
    • Scatterplot for Australianness and Femininity/Masculinity
      • Change datapoints to bins and include line of best fit ("add fit line to total")
    • Correlation - bivariate
    • r is .12, p = .100, N = 185
    • This is larger than the rpb between Gender and Australianness, but is still very small and non-significant.

Correlation guess

These exercises are designed to help you to develop your ability to estimate a product-moment correlation based on a scatterplot:

  1. Correlation Explore
    • explore 20 plots with .1 increments
  2. Correlation Guess
    • guess 20 plots with .1 increments
    • try to get 25 out of 50
  3. Guessing Correlations
    • 4 plot matching to correlations exercise
    • try to average over 75%
  4. Guess the Correlation
    • single plot, guess exact correlation
    • try to get within .1

Issues to explore

Effect of outliers

  • regressp.exe
    • Download and double-click on Regress Patch to launch
    • Continue - Explore the impact of an outlier
  • Write down the correlation for the initial distribution.
  • Drag the white point and click "Recalculate". Check the new correlation...and continue this process to experimentally work out:
  1. Where would you put the white dot to maximise the correlation?
    • as far to the ends of the line of best fit as possible (can go beyond the plot)
  2. Where would you put the white dot to minimise thecorrelation (or make it go in the opposite direction)?
    • Place the outlier as far as possible to the ends of a line which would run perpendicular to the line of best fit, crossing at the mean for X and the mean for Y
  3. Where would you put the white dot to not change the correlation?
    • On the mean for X and Y

Correlations and non-linear distributions

  • Data file: xy.sav
  • Syntax:
    corr vars = x1 y1.
    corr vars = x1 y2.
    corr vars = x1 y3.
    corr vars = x2 y4.
    GRAPH /SCATTERPLOT(BIVAR)=x1 WITH y1.
    GRAPH /SCATTERPLOT(BIVAR)=x1 WITH y2.
    GRAPH /SCATTERPLOT(BIVAR)=x1 WITH y3.
    GRAPH /SCATTERPLOT(BIVAR)=x2 WITH y4.
  1. What do the linear correlations and bivariate scatterplots indicate about the relationship between the following pairs?
    • X1 by Y1
      • r = .82 is appropriate - a strong, linear relationship
    • X1 by Y2
      • r = .82 is somewhat accurate, but really the relationship is curvilinear
    • X1 by Y3
      • r = .82 is not appropriate - really there is a perfect linear relationship plus an outlier
    • X2 by Y4
      • r = .82 is not appropriate - there is a restricted range for x2 and an outlier

Outliers and restricted range

  • aggr.sav
  • This dataset was collected by Bernd Heubeck(Division of Psychology, ANU) comparing a sample of 89 "normal" children, aged8 to 14, from Western Sydney with a sample of 89 "clinic" children from the samearea who had been referred to a children's psychiatric clinic. Thevariable clinic is coded 1 for normal children and 2 for clinicchildren. Separateaggressiveness ratings of the child were obtained independently fromfathers (faggr) and mothers (maggr). The aggressiveness ratingscale can range from 0 (low) to 40(high).
  • Reading: Restricted range - Hyperstat
  1. To what extent do mothers' and fathers' Aggressive Behaviour ratings (maggr and faggr respectively) agree with one another for the normal sample? (i.e., what is the product-moment correlation and what % of variance does one variable explain in the other variable?)
    • Select only the normal cases (Data - Select Cases- If Clin = 1 - Filtered)
    • Scatterplot maggr by faggr
      • Chart editor - add bins for datapoints
    • Correlation - bivariate
    • r = .56, r2 = .31 (i.e., 31%), p < .001, N = 89
  2. What happens if you remove the outliers (children in the normal sample with maggr and faggr ratings over 20)? Why?
    • To identify the cases which are outliers you can sort the data in descending order by maggr and faggr. Then manually the two cases (case 4 and 15 after sorting) from the normal sample with maggr and faggr ratings over 20 and re-run the scatterplot and correlations.
    • r = .38, r2 = .14, p < .001_, N_ = 87
    • r drops from .56 because the outliers were "in-line" with the line of best fit, and therefore inflating the correlation; outliers can also deflate the correlation to the extent that they lie perpendicular to the line of best fit - if you don't understand this, go back to exploring the effect of an outlier in regressp.exe)
  3. Now include the rest of the sample. Examine a new scatterplot, and compute the r and r2. What has happened? Why?
    • To regain the outliers, close the datafile without saving, then open the datafile.
    • r = .68, r2 = .46, N = 178
    • r is greater than .38 and almost three times as much variance is now explained. The reason is RANGE RESTRICTION in the normal sample, since there tend to be only low ratings. By including the clinic sample, we now have high aggressiveness data and no range restrictedness.)

See also

External links



Browse Space
- Pages
- Labels
- Attachments
- Mail
- News
- Activity
- Advanced

Explore Confluence
- Popular Labels
- Notation Guide

Your Account
Log In

 

Other Features

View a printable version of the current page.

Add Content


Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.5.3 Build:#808 May 29, 2007)
Bug/feature request - Contact Administrators