Pearson Correlation

Hello All,

Today’s post is the assignment exercise for week 3 for the Coursera class on Data Visualization Tools from Wesleyan University.

The topic is as below:

Generate a correlation coefficient.

For this round of assignments, I’m using the same bikesharing program dataset I used for week 1&2. More details can be seen in week1 assignment post. The relationships we are going to test are: a)month and count of casual renters (b) month and count of registered renters. So our hypothesis are as follows:

Analysis 1:

  1. Ho = No relationship between casual renters and month of the year.
  2. H1 = There is a significant relation between these two variables.

 

Analysis 2:

  1. Ho = No relationship between registered  renters and month of the year.
  2. H1 = There is a significant relation between these two variables.

 

Procedure for Pearson Correlation test:

  1. Month variable (mnth) ranges between 1 and 12, while average number of casual renters (variable = casual)on any given day ranges between [0,900]. The average number of total renters (variable = cnt) on any given day ranges between [0,4500].
  2. Since cnt, casual and mnth are all quantitative variables, we consider Pearson Correlation test. The complete code is available at this link w3-pearson-correl-asgt . However, the essence of the correlation test analysis is:

Proc corr;
var cnt atemp temp mnth season casual;

 

Results & Interpretation:

The complete results are also available in this pdf file: Pearson-correl-w3-results-graph Based on the output, the following conclusions can be inferred:

  1. p-value < 0.0001 for correlation between month and number of casual renters. However, the Pearson correlation coefficient r=0.1230 indicating a weak “linear” relationship. This is because the relation is curvi-linear as shown in the graph below (red trendline). Hence we can reject the null hypothesis and conclude there is a strong relation between these month and casual. Hence Ha is accepted for analysis1.  Month-vs-renters
  2.  Similarly, p-value < 0.0001 for correlation between month and number of registered bike renters. Again, the relationship shows a wide bell shape, hence the Pearson coefficient is small (r = 0.2799 ). This is because the relation is curvi-linear as shown in the graph below. Hence we can reject the null hypothesis and conclude there is a strong relation between these month and casual. Hence Ha is accepted for analysis 1, as seen in the green trendline above.
  3. Note, if we group by year (2011 and 2012), we see that the numbers have increased year on year for almost all months, as shown in graph below:

    scatter-plot-output-w3

    Increase in number of renters per year (0=2011=red, 1= 2012=blue)

  4. Note, to evaluate if the program works correctly for linear relationships, consider the correlation between variables “temp” and “atemp”. r= 0.9917, p<0.0001, indicating very high correlation and a strong dependency. This is valid because “atemp” is derived from the former.
  5. All these correlation coefficients are also shown below in the table. pearson-correlation-output2

Thank you for taking a look at my analysis. Please feel free to add any suggestions for improvement or other feedback in the comments section.

 

Chi – Square Test

Hello All,

Today’s post is the assignment exercise for week 2 for the Coursera class on Data Visualization Tools from Wesleyan University.

The topic is as below:

Run a Chi-Square Test of Independence. Analyze and interpret post hoc paired comparisons in instances where your original statistical test was significant, and you were examining more than two groups (i.e. more than two levels of a categorical, explanatory variable).

For this round of assignments, I’m using the same bikesharing program dataset I used for last week’s assignment. More details can be seen in last week’s assignment post.

The relationship we are going to test is between seasons and a categorized version of count of casual renters. So our hypothesis is as follows:

  1. Ho = No relationship between casual renters and season, ie both variables are independent.
  2. H1 = There is a significant relation between these two variables.

 

Procedure for chi-square test:

  1. First, convert the number of casual renters to a new categorical variable (casual_rng). I have done this by grouping the count into 5 sets according to the frequency distribution and user profiles.
    • If casual <245 users, casual_rng = 20, referred to as “sample20”
    • If casual between 245 and 559, casual_rng = 40. “sample40”
    •  If casual between 560 and 844, casual_rng = 60. “sample60”
    • If casual between 845 and 1262, casual_rng = 80. “sample80”
    • If casual > 1263, casual_rng = 100. “sample100”
  2.  Find association between casual-rng based on season, where “season” is another categorical variable with 4 levels:
    • 1:spring,
    • 2:summer,
    • 3:fall,
    • 4:winter
  3. We collapse this season variable to just 2 levels:
    • summer and fall = “warm”
    • winter and spring = “cold”
  4. Since there are 5 levels in casual_rng, we need to make 10 comparisons. So the Bonferoni adjusted p-value = 0.005.

 

Program Code:

The complete code is available at this link w2-asgt-code . However, the essence of the CHI-SQUARE analysis is given below:

PROC FREQ ;
TABLES season*casual_rng/chisq;

 

Results & Interpretation:

The complete results are also available in this pdf file: DV_w2_chi_sq-asgt_results. Based on the data, the following conclusions can be inferred:

  1. The p-value for the chi-square results table for season versus count of casual renters < 0.0001 . Hence we can reject the null hypothesis and conclude there is a strong relation between these two variables. Thus, the alternate hypothesis is accepted.
  2. Based on subset sampling for the post-hoc tests, we see that there are 6 groups that show significant differences, as seen by the p-value less than 0.005 (based on Bonferoni adjustments).
  3. The “sample20” users rented mostly in the cold seasons (89%) compared to “sample60”, “sample80” and “sample100″users who rented mostly in the warmer season. (67% and 74% and 80% respectively.)  p-values < 0.0001 for all comparisons.  chi-sq-anly1-imag

    season-vs-casual-renters.jpg

    Graphical representation showing which sample groups prefer renting bikes in winter.

  4. Similarly, “sample40” renters also showed a prefernce for renting the in the colder seasons with 78% borrowing bikes. Compare this to the “sample60”, “sample80” and “sample100″users who rented mostly in the warmer season. (again 67% and 74% and 80% respectively.) p-values < 0.0001 for all comparisons. chi-sq-anly2-imag
  5. Naturally, there was no significant differences between sample20 and sample40 and p-values were greater than 0.005.
  6. There were  no significant differences between sample60, sample80, sample100.

 

Thank you for taking a look at my analysis. Please feel free to add any suggestions for improvement or other feedback in the comments section.

 

ANOVA testing – Week1 Assignment

Hello All,

Today’s post is the weekly assignment for another Coursera venture : Data Visualization Tools from Wesleyan University.

The topic is as below:

Run an analysis of variance (ANOVA) with a quantitative response variable and a categorical explanatory variable. Run a post hoc test if the categorical variable has greater than two levels.

For this round of assignments, I decided to use a dataset I’ve already been playing with, rather than the ones provided by the university. (Note: the OOL dataset from the previous course has very no quant variables to do a meaningful analysis.)

Dataset details:

  • Bike sharing program data form the Univ of Porto, with  17389 instances and 16 attributes. Dataset Link is here.
  • Values for count of registered, casual and total bike rental users are provided based on month, season, weather and hour of day.

Procedure:

I have chosen to perform two sets of ANOVA analyses using SAS programming:

  1.  Relationship between number of casual renters based on season. (analysis 1)
    • Categorical variable = “season” with 4 levels,
      • 1:spring,
      • 2:summer,
      • 3:fall,
      • 4:winter.
    • Quantitative variable = “casual” i.e. number of unregistered members or casual renters, ranging from 2-3410.
  2. Relationship of total bike renters based on weather.  (analysis 2)
    • Categorical variable = “weathersit” with 4 levels,
      • 1: Clear or partially cloudy, referred to henceforth as “sunny”
      • 2: Misty and cloudy, referred to as “cloudy”
      • Light Snow or heavy Rain, referred to “harsh”.
    • Quantitative variable = “cnt” i.e total user count, ranging from 22-8714.

 

Program Code:

The complete code is available at this link w1-asgt-code . However, the essence of the ANOVA analysis is given below:

For analysis 1:

PROC ANOVA; CLASS season; MODEL casual=season;MEANS season/duncan;

For analysis 2:

PROC ANOVA; CLASS weathersit; MODEL cnt=weathersit; MEANS weathersit/duncan;

 

Results & Interpretation:

The complete results are also available in this pdf file: DV_w1-results (2)

Analysis for relation 1:

  1. The ANOVA association revealed that significantly more casual users rented bikes during Fall (Mean=1202.61) compared to winter (Mean=729.11) and spring (Mean=334.93). There was not much of a difference between fall and summer (Mean=1106.10)
  2. F(3, 1202.61)=80.80, p<0001. In this example 80.80 is the actual F value from the OLS table and p value is so small, that it is reported simply as <.0001.
  3. There was no significant statistical difference between the summer and fall count of casual users.
  4. Also, the count casual bike renters fell dramatically between spring and winter, which is logical since the winters are harsh seasons with snow and unsuitable weather conditions.

Analysis for relation 2:

  1. The ANOVA association revealed that significantly more users (both casual and registered) rented bikes during sunny (mean = 4876.78) and cloudy weather (mean = 4035.86) , as compared to harsh weather (mean = 1803.28).
  2. F(2, 4035.86) = 40.07, p<0001. In this example 40.07 is the actual F value from the OLS table and p value is again small enough to be listed as <.0001.
  3. The count for each weather condition was sufficiently different from one another.

Graphical interpretation:

For easier and more intuitive understanding, graphical results of the two analysis are added below:

Season-vs-Casual_users

Casual renters versus seasons (spring, summer, fall and winter)

Weather-vs-Total_users

Average number of Total renters based on weather (sunny, cloudy or harsh)

Thank you for taking a look at my analysis. Please feel free to add any suggestions for improvement or other feedback in the comments section.