11.3 Test of Independence
Chi-square test for independence
The Chi-square test for independence is used to determine whether there is an association between two factors or variables. Observed data values (frequencies) are portrayed in a contingency table (also called two-way table or crosstab) that displays the frequency distributions of the variables. One variable is represented in the rows and another in the columns.
EXAMPLE
The following table is a contingency table that shows fictional data for a study of speeding violations and drivers who use cell phones:
Speeding violation in the last year | Total | |||
---|---|---|---|---|
YES | NO | |||
Cell phone user |
YES | 25 | 280 | 305 |
NO | 45 | 405 | 450 | |
Total | 70 | 685 | 755 |
What variable is represented in the columns? In the rows?
SHOW SOLUTION
Review & Practice
Review and Practice: Two-way frequency tables @KhanAcademy
There are several practice exercises following video discussions on the linked KhanAcademy site. Be sure to practice them all as necessary.
Chi-square test of independent is a non parametric test based on observed frequencies. The test does not include population parameters in determining the test statistic and it does not require the population to follow a certain distribution.
When testing for independence between two variables, we compare our observed data frequencies with frequencies that we’d expect if the two factors were indeed independent. To find the expected frequencies/counts, we’re going to assume that the two factors are independent (and this assumption is going to be our null hypothesis). If the observed data vary significantly from what we’d expect under null assumption of independence, then we’ll reject the null hypothesis.
We can use the Multiplication Principle for Independent Events to compute the expected frequencies for each of the observed data. Using probability rules, we can derive the formula to compute the expected count: \[\text{Expected Frequency} = \frac{(\text{Row total})(\text{Column total}) }{\text{Grand total}}\]
If we wanted to test whether speeding violations are independent of drivers’ cell phone usage, we’d start by computing the expected counts for each data cell. For the sample data in the table above, if speeding violations and cell phone usage were independent, then: \[\text{Expected Frequency} = \frac{(\text{Row total})(\text{Column total}) }{\text{Grand total}}=\frac{(305)(70)}{755}=28.2781457 \]This tells us that if speeding violations are independent of drivers’ cell phone usage, then we can expect to see [latex]28.2781457[/latex] cell phone users with speeding violations.
After calculating the expected frequencies for each of the data cells in the contingency table, we can calculate the contribution to the Chi-square statistic from each data cell by dividing the square of the difference between observed and expected frequencies by the expected frequency:
Contribution to [latex]\chi^2[/latex] test statistic =[latex]\dfrac{(\text{Observed} -\text{Expected} )^2}{\text{Expected}}[/latex].
Complete the following (Round decimals to 2 places):
We add the contribution from each data value in the table to arrive at the Chi-square test statistic. Note that the totals are not data, they are summaries.
Test statistic, [latex]\chi^2=\sum{\dfrac{(O-E)^2}{E}}[/latex]. Note that this [latex]\chi^2[/latex] test statistic value is higher with larger differences between observed and expected frequencies, leading to a rejection of the null hypothesis when the test statistic significantly large and way out in the right tail of the chi-square curve. Hence the test of independence is always a right-tailed test. On the other hand, if the observed frequencies are similar to what we’d expect if the variables are independent, then each data will only contribute a small amount toward the [latex]\chi^2[/latex] test statistic and we’ll fail to reject the null hypothesis (that the variables are independent) . The degrees of freedom for the test of independence is given by:
df = (#rows – 1)(#columns – 1)
Requirements for Test of Independence
- Sample is simple random sample (SRS)
- The expected count (frequency) for each cell needs to be at least five in order for you to use this test
- Sample data are frequencies (counts) from two categorical variables
Note that for Chi-square test of independence:
- The NULL hypothesis always the statement that the two factors/variables (Whatever they are) are independent
Ho: The two variables (factors) are independent. - The ALTERNATIVE always states that the two factors/variables are dependent.
Ha: The two variables (factors) are dependent.
Steps for Hypothesis Testing of Independence
STEP 1: Write the claim (or what’s being tested) in words
STEP 2: Write the opposite of the claim using mathematical symbols
STEP 3: Write the null and alternative hypothesis (Remember that Ho is the statement about independence)
STEP 4: Identity the level of significance, α. If α is not given, assume it to be 5% or 0.05. The test will always be a right-tailed test.
STEP 5: Calculate the expected frequencies (counts) for each cell in the contingency table. Check to make sure that all expected counts are at least 5.
STEP 6: Calculate the contribution from each data to the Chi-square statistic. Add up the contributions from all data to arrive at the Chi-square test statistic.
STEP 7: Draw the Chi-square graph for degrees of freedom, df = (#rows – 1)(#columns – 1). Plot the Chi-square test statistic on this graph. Compute the p-value (the area of the right tail from the Chi-square test statistic.
STEP 8: Make the decision by comparing p-value with α.
» If p-value < α, ⇒ reject Ho ⇒ Test results are statistically significant and sample data provides enough evidence to support Ha
» p-value ≥ α, ⇒ Fail to reject Ho ⇒ There’s not enough evidence to support Ha
STEP 9: Interpret your decision in the context of original claim.
Note that we’ll use calculators to do most of the heavy lifting on calculations. STEPS 6 and 7 will entirely be done on a calculator. Some calculators such as SUBEDI CALC will also allow you to calculate the expected frequencies in STEP 5.
KEEP IN MIND…
When working on assignments, be on the lookout for certain keywords that point to a test of independence. Some examples:
- Is the number of hours volunteered independent of the type of volunteer?
- Is the number of hours volunteered related to the type of volunteer?
- Is the number of hours volunteered associated with the type of volunteer?
- Does the number of hours volunteered depend on the type of volunteer?
- Is there a relationship between number of hours volunteered and the type of volunteer?
Additional Resources
Review: Chi-square test for association (independence)
When completing the practice exercises you may skip the ones related to the test for homogeneity.
EXAMPLE
The human papillomavirus associated with the development of cervical cancer, genital warts, and some less common cancers. John Hopkins conducted a study to investigate the characteristics of female patients who came to their clinics between 2006 and 2008 to begin the three-shot regimen of vaccinations with the anti-human papillomavirus (HPV) medication Gardasil. Conduct a Chi-square test for independence to determine if the association between race and vaccine completion is statistically significant. A contingency table summarizing the vaccine completions by race for 1152 patients is provided below.
Vaccine Completion | |||
Race↓ | No | Yes | Total |
Black | 269 | 118 | 387 |
Hispanic | 28 | 18 | 46 |
White | 455 | 264 | 719 |
Total | 752 | 400 | 1152 |
SHOW SOLUTION
Chi-Square Test of Independence
STEP 1: For this test, the two factors/variables being studied are: Race and Vaccine Completion.
Claim to test: Is there an association between race and vaccine completion. The opposite would be that there’s no association.
Setup Null and Alternative Hypothesis
For a Chi-square test of independence, the null hypothesis is that race and vaccine completion are independent whereas the alternative hypothesis states that the two are dependent.
STEP 2: [latex]H_o:[/latex] Race and vaccine completion are independent (OR There is no association between race and vaccine completion)
STEP 3: [latex]H_a:[/latex] Race and vaccine completion are dependent (OR There is an association between race and vaccine completion)
STEP 4: Since [latex]\alpha[/latex] is not given, assume it to be 5% or 0.05.
STEP 5: Calculate Expected Frequencies. Round to three decimal places.
This result is shown on SUBEDI Calculator (see below) as well as TI-83/84 Calculators
Expected Frequencies | ||
No | Yes | |
Black | [latex]\dfrac{387\cdot752}{1152}=252.625[/latex] | [latex]\dfrac{387\cdot400}{1152}=134.375[/latex] |
Hispanic | [latex]\dfrac{46\cdot752}{1152}=30.028[/latex] | [latex]\dfrac{387\cdot400}{1152}=15.972[/latex] |
White | [latex]\dfrac{719\cdot752}{1152}=469.347[/latex] | [latex]\dfrac{387\cdot400}{1152}=249.653[/latex] |
Results from STEPS 6 and 7 can be obtained using a calculator. See below.
STEP 6: Calculate the contribution from each data to the Chi-square statistic.
[latex]\chi^2[/latex] Contributrion | ||
No | Yes | |
Black | [latex]\dfrac{(269-252.625)^2}{252.625}[/latex] | [latex]\dfrac{(118-134.375)^2}{134.375}[/latex] |
Hispanic | [latex]\dfrac{(28-30.028)^2}{30.028}[/latex] | [latex]\dfrac{(18-15.972)^2}{15.972}[/latex] |
White | [latex]\dfrac{(45-469.347)^2}{469.3475}[/latex] | [latex]\dfrac{(264-249.653)^2}{249.653}[/latex] |
After evaluating each of the Chi-Square contribution in the table cells and summing them up we get:
[latex]\chi^2[/latex] test statistic [latex]=4.7143471148634[/latex].
STEP 7: Draw the Chi-square graph. Plot the Chi-Square statistic. Compute p-value.
Degrees of freedom, [latex]df=[/latex](#rows – 1)(#columns – 1). Since there are 3 rows and 2 columns, the degrees of freedom, [latex]df = (3-1)(2-1) = 2[/latex]
Click on edit graph on desmos below the graph to go to DESMOS. On the Desmos page, in the entry box #3 where it shows Dfreedom = 5, use the slider or just type in the value of your degrees of freedom in place of the 5. On the next line change the value of XChiSquare from the default value to the value of your test statistic, [latex]4.714[/latex]. The p-value for our test will be shown on the next line. p-value should be [latex]0.094687474041118[/latex] (make sure to round as needed).
STEPS 6 and 7: Using Online Calculator (Recommended)
SUBEDI Calculator
Go to Chi-Square Tests @ rsubedi.com
Chi-Square Test Type
Sample Data
Type in data or paste clipboard content copied from spreadsheet applications such as Excel, Google Sheets, etc. If entering data manually, add rows/columns as necessary using add/remove buttons.
Category 1 | Category 2 | |
First Row | 268 | 111 |
Second Row | 23 | 23 |
Third Row | 462 | 275 |
CALCULATE
Results show in a panel below the calculate button. Test statistic ([latex]\chi^2[/latex]), p-value, [latex]df[/latex], and expected frequencies are displayed.
LibreText Calculator
Go to: Test for Independence from the list of online calculators
Enter the following values and press Calculate.
A | B | |
First | 268 | 111 |
Second | 23 | 23 |
Third | 462 | 275 |
CALCULATE
Results displayed are:
STEP 8: Make the decision by comparing p-value with α.
Since the p-value of [latex]0.094687474041118[/latex] is more than (or equal to) [latex]\alpha[/latex] of [latex]5\%[/latex], we fail to reject the null hypothesis.
STEP 9: Interpret your decision in the context of original claim.
The p-value indicates that the differences in the vaccination completion rates by race are not statistically significant. There is insufficient evidence to conclude that race and vaccine completion are related.
Comparison of the Chi-Square Tests
Section 11.5 of OpenStax textbook. Not officially covered but useful to know.
In our textbook, you will the χ2 test statistic used in three different circumstances. The following bulleted list is a summary that will help you decide which χ2 test is the appropriate one to use.
- NOT COVERED IN THIS CLASS
✗ Goodness-of-Fit: Use the goodness-of-fit test to decide whether a population with an unknown distribution “fits” a known distribution. In this case there will be a single qualitative survey question or a single outcome of an experiment from a single population. Goodness-of-Fit is typically used to see if the population is uniform (all outcomes occur with equal frequency), the population is normal, or the population is the same as another population with a known distribution. The null and alternative hypotheses are:H0: The population fits the given distribution.
Ha: The population does not fit the given distribution. - COVERED
✓ Independence: Use the test for independence to decide whether two variables (factors) are independent or dependent. In this case there will be two qualitative survey questions or experiments and a contingency table will be constructed. The goal is to see if the two variables are unrelated (independent) or related (dependent). The null and alternative hypotheses are:H0: The two variables (factors) are independent.
Ha: The two variables (factors) are dependent. - NOT COVERED IN THIS CLASS
✗ Homogeneity: Use the test for homogeneity to decide if two populations with unknown distributions have the same distribution as each other. In this case there will be a single qualitative survey question or experiment given to two different populations. The null and alternative hypotheses are:H0: The two populations follow the same distribution.
Ha: The two populations have different distributions.
Practice