Pearson Correlation

Hello All,

Today’s post is the assignment exercise for week 3 for the Coursera class on Data Visualization Tools from Wesleyan University.

The topic is as below:

Generate a correlation coefficient.

For this round of assignments, I’m using the same bikesharing program dataset I used for week 1&2. More details can be seen in week1 assignment post. The relationships we are going to test are: a)month and count of casual renters (b) month and count of registered renters. So our hypothesis are as follows:

Analysis 1:

  1. Ho = No relationship between casual renters and month of the year.
  2. H1 = There is a significant relation between these two variables.

 

Analysis 2:

  1. Ho = No relationship between registered  renters and month of the year.
  2. H1 = There is a significant relation between these two variables.

 

Procedure for Pearson Correlation test:

  1. Month variable (mnth) ranges between 1 and 12, while average number of casual renters (variable = casual)on any given day ranges between [0,900]. The average number of total renters (variable = cnt) on any given day ranges between [0,4500].
  2. Since cnt, casual and mnth are all quantitative variables, we consider Pearson Correlation test. The complete code is available at this link w3-pearson-correl-asgt . However, the essence of the correlation test analysis is:

Proc corr;
var cnt atemp temp mnth season casual;

 

Results & Interpretation:

The complete results are also available in this pdf file: Pearson-correl-w3-results-graph Based on the output, the following conclusions can be inferred:

  1. p-value < 0.0001 for correlation between month and number of casual renters. However, the Pearson correlation coefficient r=0.1230 indicating a weak “linear” relationship. This is because the relation is curvi-linear as shown in the graph below (red trendline). Hence we can reject the null hypothesis and conclude there is a strong relation between these month and casual. Hence Ha is accepted for analysis1.  Month-vs-renters
  2.  Similarly, p-value < 0.0001 for correlation between month and number of registered bike renters. Again, the relationship shows a wide bell shape, hence the Pearson coefficient is small (r = 0.2799 ). This is because the relation is curvi-linear as shown in the graph below. Hence we can reject the null hypothesis and conclude there is a strong relation between these month and casual. Hence Ha is accepted for analysis 1, as seen in the green trendline above.
  3. Note, if we group by year (2011 and 2012), we see that the numbers have increased year on year for almost all months, as shown in graph below:

    scatter-plot-output-w3

    Increase in number of renters per year (0=2011=red, 1= 2012=blue)

  4. Note, to evaluate if the program works correctly for linear relationships, consider the correlation between variables “temp” and “atemp”. r= 0.9917, p<0.0001, indicating very high correlation and a strong dependency. This is valid because “atemp” is derived from the former.
  5. All these correlation coefficients are also shown below in the table. pearson-correlation-output2

Thank you for taking a look at my analysis. Please feel free to add any suggestions for improvement or other feedback in the comments section.

 

Advertisements

Please share your feedback and opinions. Thanks!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s