Hello All,
Today’s post is the assignment exercise for week 3 for the Coursera class on Data Visualization Tools from Wesleyan University.
The topic is as below:
Generate a correlation coefficient.
For this round of assignments, I’m using the same bikesharing program dataset I used for week 1&2. More details can be seen in week1 assignment post. The relationships we are going to test are: a)month and count of casual renters (b) month and count of registered renters. So our hypothesis are as follows:
Analysis 1:
- Ho = No relationship between casual renters and month of the year.
- H1 = There is a significant relation between these two variables.
Analysis 2:
- Ho = No relationship between registered renters and month of the year.
- H1 = There is a significant relation between these two variables.
Procedure for Pearson Correlation test:
- Month variable (mnth) ranges between 1 and 12, while average number of casual renters (variable = casual)on any given day ranges between [0,900]. The average number of total renters (variable = cnt) on any given day ranges between [0,4500].
- Since cnt, casual and mnth are all quantitative variables, we consider Pearson Correlation test. The complete code is available at this link w3-pearson-correl-asgt . However, the essence of the correlation test analysis is:
Proc corr;
var cnt atemp temp mnth season casual;
Results & Interpretation:
The complete results are also available in this pdf file: Pearson-correl-w3-results-graph Based on the output, the following conclusions can be inferred:
- p-value < 0.0001 for correlation between month and number of casual renters. However, the Pearson correlation coefficient r=0.1230 indicating a weak “linear” relationship. This is because the relation is curvi-linear as shown in the graph below (red trendline). Hence we can reject the null hypothesis and conclude there is a strong relation between these month and casual. Hence Ha is accepted for analysis1.
- Similarly, p-value < 0.0001 for correlation between month and number of registered bike renters. Again, the relationship shows a wide bell shape, hence the Pearson coefficient is small (r = 0.2799 ). This is because the relation is curvi-linear as shown in the graph below. Hence we can reject the null hypothesis and conclude there is a strong relation between these month and casual. Hence Ha is accepted for analysis 1, as seen in the green trendline above.
- Note, if we group by year (2011 and 2012), we see that the numbers have increased year on year for almost all months, as shown in graph below:
- Note, to evaluate if the program works correctly for linear relationships, consider the correlation between variables “temp” and “atemp”. r= 0.9917, p<0.0001, indicating very high correlation and a strong dependency. This is valid because “atemp” is derived from the former.
- All these correlation coefficients are also shown below in the table.
Thank you for taking a look at my analysis. Please feel free to add any suggestions for improvement or other feedback in the comments section.