Today’s post is the assignment exercise for week 2 for the Coursera class on Data Visualization Tools from Wesleyan University.
The topic is as below:
Run a Chi-Square Test of Independence. Analyze and interpret post hoc paired comparisons in instances where your original statistical test was significant, and you were examining more than two groups (i.e. more than two levels of a categorical, explanatory variable).
For this round of assignments, I’m using the same bikesharing program dataset I used for last week’s assignment. More details can be seen in last week’s assignment post.
The relationship we are going to test is between seasons and a categorized version of count of casual renters. So our hypothesis is as follows:
- Ho = No relationship between casual renters and season, ie both variables are independent.
- H1 = There is a significant relation between these two variables.
Procedure for chi-square test:
- First, convert the number of casual renters to a new categorical variable (casual_rng). I have done this by grouping the count into 5 sets according to the frequency distribution and user profiles.
- If casual <245 users, casual_rng = 20, referred to as “sample20”
- If casual between 245 and 559, casual_rng = 40. “sample40”
- If casual between 560 and 844, casual_rng = 60. “sample60”
- If casual between 845 and 1262, casual_rng = 80. “sample80”
- If casual > 1263, casual_rng = 100. “sample100”
- Find association between casual-rng based on season, where “season” is another categorical variable with 4 levels:
- We collapse this season variable to just 2 levels:
- summer and fall = “warm”
- winter and spring = “cold”
- Since there are 5 levels in casual_rng, we need to make 10 comparisons. So the Bonferoni adjusted p-value = 0.005.
The complete code is available at this link w2-asgt-code . However, the essence of the CHI-SQUARE analysis is given below:
PROC FREQ ;
Results & Interpretation:
The complete results are also available in this pdf file: DV_w2_chi_sq-asgt_results. Based on the data, the following conclusions can be inferred:
- The p-value for the chi-square results table for season versus count of casual renters < 0.0001 . Hence we can reject the null hypothesis and conclude there is a strong relation between these two variables. Thus, the alternate hypothesis is accepted.
- Based on subset sampling for the post-hoc tests, we see that there are 6 groups that show significant differences, as seen by the p-value less than 0.005 (based on Bonferoni adjustments).
- The “sample20” users rented mostly in the cold seasons (89%) compared to “sample60”, “sample80” and “sample100″users who rented mostly in the warmer season. (67% and 74% and 80% respectively.) p-values < 0.0001 for all comparisons.
- Similarly, “sample40” renters also showed a prefernce for renting the in the colder seasons with 78% borrowing bikes. Compare this to the “sample60”, “sample80” and “sample100″users who rented mostly in the warmer season. (again 67% and 74% and 80% respectively.) p-values < 0.0001 for all comparisons.
- Naturally, there was no significant differences between sample20 and sample40 and p-values were greater than 0.005.
- There were no significant differences between sample60, sample80, sample100.
Thank you for taking a look at my analysis. Please feel free to add any suggestions for improvement or other feedback in the comments section.