Cheatsheet – Selecting Graphs for Statistical Analysis

June 23, 2016June 23, 2016 / Anu Rajaram

One of the first steps with any statistical analysis, whether for hypothesis testing or predictive analytics or even a Kaggle competition, is checking the relationship between different variables. Checking if a pattern exists.

Graphs are a fantastic and visual way of identifying such relationships.

MATPLOTLIB Graph

However, numerous readers kept getting stuck while selecting graphs for categorical variables and many friends asked if there was a standard rule for graph selection. With that in mind, please see below a cheatsheet for graphical selection for both quantitative (numeric) and categorical ( character -gender, disease type, etc.) variables.

No.	Axis1	Axis2	Chart type
1.	Single quant		Histograms, Density plot, Box plot
2.	Single categorical		Bar chart (freq/ count), Pie chart (freq/ count/%)
2.	Categorical	Quant	Bar chart, pie chart, frequency table, line chart
3.	Quant	Quant	Scatterplot
4.	Categorical	Categorical	Stacked Column Chart, combination chart (typical bar chart with trendlines)
5.	2 categorical	Quant	Stacked or side-by-side bar charts, heat maps. Any basic graph, with Color/shape code for one of the quant variables.
6.	1 categorical	2 Quant	Stacked or side-by-side bar charts, Scatter plots. Any basic graph, with Color/shape code for one of the quant variables.
7.	3+ variables of any type		Please check if you really need so many variables in a single graph. Side-by-side graphs may be a better option, or graphs with filters (if possible based on the programming language)

These are merely guidelines and are language-agnostic, so you may choose to implement them in your choice of programming language ( R, Python, SAS, MATLAB, etc.) . However, if you prefer, code implementations in R and Python are provided in the links below:

Charts in R :
- Program1 for histograms, density plot, etc. and
- Program2 for heatmaps.
Charts in Python :
- This link contains code and images to create stunning graphs (box plots, histograms, heatmaps, bubble charts, etc) using MATPLOTLIB library, like the one shown above.

Hope you find this cheatsheet useful! Feel free to share your thoughts and comments. Adieu!

25+ free datasets for Datascience projects

January 5, 2016January 7, 2016 / Anu Rajaram / 1 Comment

Here are top 25 websites to gather datasets to use for your data science projects in R, Python, SAS, Excel or other programming language or statistical software. Best part, these are all free, free, free!

25 Free Datasets for DataScience & BigData Projects

Government and UN/World Bank websites:

US government database with 190k+ datasets –http://catalog.data.gov/dataset
UK government database with 25k+ datasets – https://data.gov.uk/data/search
Canada government database – http://open.canada.ca/data/en/dataset?q=education
FBI crime statistics – http://1.usa.gov/1LltHEQ
Center for Disease Control – http://wonder.cdc.gov/
Bureau of Labor Statistics – http://www.bls.gov/data/
NASA datasets – http://nssdc.gsfc.nasa.gov/
World Bank Data – http://datacatalog.worldbank.org/
UN database with 34 sets and 60 million records – http://data.un.org/
EU commission open data – https://open-data.europa.eu/en/data/
NIST – http://1.usa.gov/1JpmcNI
National Center for Education Statistics – http://1.usa.gov/1mAjH0A
U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) – dataset from survey to determine magnitude of alcohol use and psychiatric disorders in the U.S. population.
- dataset here.
- descriptive codebook here.

Academic websites:

Yelp academic data – https://www.yelp.com/academic_dataset
Univ of California, Irvine – http://archive.ics.uci.edu/ml/datasets.html
Harvard Univ: http://gis.harvard.edu/resources/data
Harvard Dataverse database: http://bit.ly/1RlXNKa
MIT: http://web.mit.edu/towtank/www/vivdr/datasets.html. Also, http://bit.ly/1IMJVri
Univ of North Carolina, adolescent health – http://www.cpc.unc.edu/projects/addhealth/data
Mars Crater Study, a global database that includes over 300,000 Mars craters 1 km or larger, provided by Wesleyan University:
- Dataset link.
- Descriptive guide.

Kaggle & Datascience resources:

Few of my favs from Kaggle Website
- Walmart recruting at stores – http://bit.ly/1IMLANC
- Airbnb new user booking predictions – http://bit.ly/1N8G0QT
- US dept of education scorecard -https://www.kaggle.com/kaggle/college-scorecard
- Titanic Survival Analysis – https://www.kaggle.com/c/titanic
Databits.io – http://databits.io/challenges/opensource . My favorites among these are :
- Edx – http://bit.ly/1Pb7c2L
- Airbnb – http://bit.ly/1VBJIZF
Datasets on Climate information, human genome data, Enron email information, etc – https://www.quandl.com/search?type=free
Gapminder – http://www.gapminder.org/data/

Curated Lists:

KDnuggets provides a great list of datasets from almost every field imaginable – space, music, books, etc. May repeat some datasets from the list above.
http://www.kdnuggets.com/datasets/index.html
An eclectic mix of datasets about gun ownership, NYPD crime rates, college student study habits and caffeine concentrations in popular beverages – https://www.reddit.com/r/datasets
Data Science Central has also curated many datasets for free – http://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free
List of open datasets from DataFloq – https://datafloq.com/public-data/?sp=6358335213372237508418

Others:

MRI brain scan images and data – http://bit.ly/1kFfcke
Economic, education, Health and other datasets from Quandl. Please note this site also has a premium version of other datasets – https://www.quandl.com/search?type=free
Google repository of digitized books and ngram viewer – https://books.google.com/ngrams. Sample chart shown below:
Database with geographical information – http://freegisdata.rtwilson.com/
Loan information from Lending Club – https://www.lendingclub.com/info/download-data.action

Moderator variable with Chi-Square test

December 6, 2015 / Anu Rajaram

Hello All,

Today’s post is the assignment exercise for week 4 for the Coursera class on Data Visualization Tools from Wesleyan University.

The topic is as below:

Run an ANOVA, Chi-Square Test or correlation coefficient that includes a moderator.

For this round of assignments, I’m using the outlook on life dataset provided for the course, as available here. Today I am going to test the confidence to achieve secure retirement (var = W1_F4_B) based on incomegroup (INCOME, calculated from given var = PPINCIMP). The moderator variable is marital status (MARIT, computed from PPMARIT).

I am using the chi-square test for this assignment.

The hypothesis for this assignment is as follows:

Ho = No relationship between INCOME and W1_F4_B.
H1 = There is a significant relation between above two variables.

Procedure for Chi-Square test:

INCOME variable has 5 levels :
- 20 => income between 0 to 19,999
- 40 => income between 20,000 to 39,999
- 60 => income between 40,000 to 59,999
- 80 => income between 60,000 to 99,999
- 100 => income greater than 99,999
MARIT variable has 4 levels:
- 1 => Married or living with partner
- 2 => widowed
- 3 => separated or divorced
- 4 => never married.
W1_F4_B is modified to have only 2 levels :
- 1 = Very hard or somewhat hard
- 4 = Very easy or somewhat easy.
The code for this program is located at my github SAS folder. The essence of the code is :

PROC FREQ;

TABLES W1_F4_B*INCOME/chisq;

BY MARIT;
There are 5 levels in INCOME, so we need to make 10 comparisons. Hence Bonferoni adjusted p-value = 0.005.
Code with moderator in the posthoc test comparisons:
/* comparison set 1 */DATA COMPARISON1; SET temp_chk;TITLE ‘Comparison range 20 & 40’;IF INCOME=20 OR INCOME=40;PROC FREQ; TABLES W1_F4_B*INCOME/chisq; BY MARIT;
Code without moderator in the posthoc test comparisons:

/* comparison set 1 */DATA COMPARISON1; SET temp_chk;TITLE ‘Comparison range 20 & 40’;IF INCOME=20 OR INCOME=40;PROC FREQ; TABLES W1_F4_B*INCOME/chisq;

Results & Interpretation:

The complete results are also available in thisW4-INCOME-WEALTH-MODVAR-MARIT-POSTHOC-MARIT

Based on the output, the following conclusions can be inferred:

For the main chi-square test, we see that Ha = TRUE only for MARIT = 1 ( married or living together). So we accept an association between wealth confidence and income only for married couples. The null hypothesis is true for all other marital status.
The other main trend is that majority of survey respondents show little confidence in achieving their wealth goals and a secure financial retirement status. At lower incomes, this is overwhelmingly so, but even at highest income levels, only about 35-40% respondents remain positive.
% of users who show high/low confidence to achieve secure financial retirement
Thus we see that major answer differences between lowest and highest income groups only for marital status MARIT = 1 . (married and those living with their partners)
Based on adjusted p-value < 0.005, we see a statistical difference for income samples 20&100, 40&100, 60&100.
I ran the program with posthoc tests both with & without considering marital status as moderator, but the trend again is seen only for marit = 1. If we do not use the moderator variable for posthoc tests, we only see one extra comparison group that is statistically different (income groups 20&80)

Thank you for taking a look at my analysis. Please feel free to add any suggestions for improvement or other feedback in the comments section.

Journey of Analytics

Deep dive into data analysis tools, theory and projects

SAS