Graphical Data Exploration

The first step in any datascience project is understanding the data at hand and identifying patterns and relationships. In today’s post we will do just that, using the “visa-salary-set”, which is a dataset containing information from US visa applications for “high-skilled” including annual wages, job title, employer name, work state, etc.

Similar data can be obtained from the Bureau of Labor Statistics or sites like myvisajobs.com. The latter is a great tool if you want to check out if your “dream” company sponsors work-visas! We will also be using numerous R-packages for advanced visualizations and interactive maps.

The full code is provided here, while the post below is simply a high-level explanation.

So let’s get started…

Step 1: Prepare the workspace

In this step, we first empty the workspace to delete unwanted variables and free up memory. Remember to load all the required library packages (sqldf to use SQL queries, ggplot and ggvis for visualizations, etc.), specify default number formatting, before loading the actual dataset.

In some cases, there may be separate training and test datasets and even a file with supplemental data (specially Kaggle competitions). The last usually is analogous to a secondary table with  foreign key. To make sense or do anything meaningful (like feature engineering) you have to connect/join it to the main table.

In our case, we only have one file “salary.csv“. I prefer using the fread() instead of read.csv due to the speed. It does not matter much in this case, but you will see a significant speed advantage for any sets with 5million+ rows. We will also read in a tiny file containing US states, union territories and geographical coordinates of their capital cities. I created this myself and find it very useful when working with maps or leaflets. Feel free to download and store for your own use as well.

saldata = fread(“salary_data.csv”)  # 167278 rows and 26 columns.

state_us = fread(“C:/anu/data analytics/us_states_abbr.csv”)  #geog info – US states

 

Step 2 : Data Preparation

I always like to start by exploring basic summary  for the dataset, followed by a quick check to see number of NAs and unique values in every column. This allows to identify which columns to ignore and which ones to keep.

summary(saldata)    # median and quantile information for all columns.

sapply(saldata, function(x) sum(is.na(x)))      # no. of NAs

sapply(saldata, function(x) length(unique(x)))   # no. of factors or unique values, useful if you have gender, state or recurring values.

High-level observations:

  • 6 columns (COLLEGE_MAJOR_REQUIRED, WORK_POSTAL_CODE, etc.) have too many missing values, so we discard them from our set.
  • Some missing values can be logically added, like EXPERIENCE_REQUIRED_Y_N should default to “y”, so we will add code to fill those missing values.
  • Date values are interpreted as “string”, so we format them using the as.Date() function to extract year, month and day of the year.

Next, we add geographical coordinates using the “state-info” dataset by using a join with name of the state. This will simply create a “generic” geographical marker for each state, located at the location of its capital (not center of state). We use this step to narrow the dataset further.

sal = sqldf(“select activesal.EMPLOYER_NAME, activesal.JOB_TITLE,
activesal.WORK_STATE2, activesal.VISA_CLASS,
activesal.PAID_WAGE_PER_YEAR,
activesal.JOB_TITLE_SUBGROUP, activesal.caseyr,
state_us.Latitude, state_us.Longitude
FROM activesal , state_us
where activesal.WORK_STATE2 = state_us.WORK_STATE2”)  ## SQL join

 

Step 3: Graphical Data Exploration:

Now we come to the exciting part, and my favorite- visual exploration! 🙂 Using the graphical library functions, we see the following patterns:

Highest paying Employers by State: (Notice the markers at the capital locations. Markers are automatically clustered for smaller states too close to each other)

Best paymasters by state, based on avg salary on work-visa applications.

Highest paying employers by state

Continue reading

July Projects – Data Visualization with R

Projects for month of July are now uploaded on Projects page. This month’s motif is advanced graphics and visualizations in R. Basic graphs are already added here.

  1. bcharts.R : Program to create horizontal bar charts and side-stacked bar charts using ggplot package.
  2. surfaceplots.R : Program to create 3D visuals of geometric and complex mathematical functions, like the one below: surface_plot
  3. explore_new.R: Graphical data exploration using the US education scorecard data from a Kaggle competition. In this program we use the leaflet package to construct an interactive US state-level map to showcase college data (admission rate, average faculty salary, median student debt, etc). The map allows user to zoom in to view details about the college. (Image below)  map_univ
  4. bubble_chart.R : Attractive bubble charts showing crime rates per district. Dataset is pulled in from the Kaggle competition for San Francisco crime data.

Feel free to fork the code to adapt for your own needs. As usual, feel free to leave your feedback in the comments section beneath.

Cheatsheet – Selecting Graphs for Statistical Analysis

One of the first steps with any statistical analysis, whether for hypothesis testing or predictive analytics or even a Kaggle competition, is checking the relationship between different variables. Checking if a pattern exists.

Graphs are a fantastic and visual way of identifying such relationships.

graph-matplotlib

MATPLOTLIB Graph

However, numerous readers kept getting stuck while selecting graphs for categorical variables and many friends asked if there was a standard rule for graph selection. With that in mind, please see below a cheatsheet for graphical selection for both quantitative (numeric) and categorical ( character -gender, disease type, etc.) variables.

 

 No.

Axis1

Axis2

Chart type

1.

Single quant

Histograms, Density plot, Box plot
2.

Single categorical

Bar chart (freq/ count), Pie chart (freq/ count/%)
2.

Categorical

Quant

Bar chart, pie chart, frequency table, line chart
3.

Quant

Quant

Scatterplot
4.

Categorical

Categorical

Stacked Column Chart, combination chart (typical bar chart with trendlines)
5.

2 categorical

Quant

Stacked or side-by-side bar charts, heat maps. Any basic graph, with Color/shape code for one of the quant variables.
6.

1 categorical

2 Quant

Stacked or side-by-side bar charts, Scatter plots. Any basic graph, with Color/shape code for one of the quant variables.
7.

3+ variables of any type

Please check if you really need so many variables in a single graph. Side-by-side graphs may be a better option, or graphs with filters (if possible based on the programming language)

These are merely guidelines and are language-agnostic, so you may choose to implement them in your choice of programming language ( R, Python, SAS, MATLAB, etc.) . However, if you prefer, code implementations in R and Python are provided in the links below:

  • Charts in R :
  • Charts in Python :
    • This link contains code and images to create stunning graphs (box plots, histograms, heatmaps, bubble charts, etc) using MATPLOTLIB library, like the one shown above.

Hope you find this cheatsheet useful! Feel free to share your thoughts and comments. Adieu!