How to raise money on Kickstarter – tutorial with EDA and predictions

February 3, 2018 / Anu Rajaram / 1 Comment

( This post is mirrored from our main blog site blog.journeyofanalytics.com . The code can be downloaded or run “LIVE” from Kaggle using this link.)

Qs for Exploratory Analysis:

We will start our analysis with the aim of answering the following questions:

1. How many projects were successful on Kickstarter, by year and category.
2. Which sub-categories raised the most amount of money?
3. Projects originate from which countries?
4. How many projects exceeded their funding goal by 50% or more?
5. Did any projects reach $100,000 or more? $1,000,000 or higher?
6. What was the average amount contributed by each backer, and how does this change over time? Does this amount differ with categories?
7. What is the average funding period?

Predicting success rates:

Using the answers from the above questions, we will try to create a model that can predict which projects are most likely to be successful.

If you find this tutorial useful or interesting, then please do upvote the kernel ! 🙂

Step1 – Data Pre-processing

a) Let us take a look at the input dataset :

## 
Read 68.7% of 378661 rows
Read 378661 rows and 14 (of 14) columns from 0.051 GB file in 00:00:03

## 'data.frame':    378661 obs. of  14 variables:
##  $ ID              : int  1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
##  $ name            : chr  "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
##  $ category        : chr  "Poetry" "Narrative Film" "Narrative Film" "Music" ...
##  $ main_category   : chr  "Publishing" "Film & Video" "Film & Video" "Music" ...
##  $ currency        : chr  "GBP" "USD" "USD" "USD" ...
##  $ deadline        : chr  "2015-10-09" "2017-11-01" "2013-02-26" "2012-04-16" ...
##  $ goal            : num  1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
##  $ launched        : chr  "2015-08-11 12:12:28" "2017-09-02 04:43:57" "2013-01-12 00:20:50" "2012-03-17 03:24:11" ...
##  $ pledged         : num  0 2421 220 1 1283 ...
##  $ state           : chr  "failed" "failed" "failed" "failed" ...
##  $ backers         : int  0 15 3 1 14 224 16 40 58 43 ...
##  $ country         : chr  "GB" "US" "US" "US" ...
##  $ usd.pledged     : num  0 100 220 1 1283 ...
##  $ usd_pledged_real: num  0 2421 220 1 1283 ...

The projects are divided into main and sub-categories. The pledged amount “usd_pledged” has an equivalent value converted to USD, called “usd_pledged_real”. However, the goal amount does not have this conversion. So for now, we will use the amounts as is.

We can see how many people are backing each individual project using the column, “backers”.

b) Now let us look at the first 5 records:

The name doesn’t really indicate any specific pattern although it might be interesting to see if longer names have better success rates. Not pursuing that angle at this time, though.

##           ID                                                       name
## 1 1000002330                            The Songs of Adelaide & Abullah
## 2 1000003930              Greeting From Earth: ZGAC Arts Capsule For ET
## 3 1000004038                                             Where is Hank?
## 4 1000007540          ToshiCapital Rekordz Needs Help to Complete Album
## 5 1000011046 Community Film Project: The Art of Neighborhood Filmmaking
## 6 1000014025                                       Monarch Espresso Bar
##         category main_category currency   deadline  goal
## 1         Poetry    Publishing      GBP 2015-10-09  1000
## 2 Narrative Film  Film & Video      USD 2017-11-01 30000
## 3 Narrative Film  Film & Video      USD 2013-02-26 45000
## 4          Music         Music      USD 2012-04-16  5000
## 5   Film & Video  Film & Video      USD 2015-08-29 19500
## 6    Restaurants          Food      USD 2016-04-01 50000
##              launched pledged      state backers country usd.pledged
## 1 2015-08-11 12:12:28       0     failed       0      GB           0
## 2 2017-09-02 04:43:57    2421     failed      15      US         100
## 3 2013-01-12 00:20:50     220     failed       3      US         220
## 4 2012-03-17 03:24:11       1     failed       1      US           1
## 5 2015-07-04 08:35:03    1283   canceled      14      US        1283
## 6 2016-02-26 13:38:27   52375 successful     224      US       52375
##   usd_pledged_real
## 1                0
## 2             2421
## 3              220
## 4                1
## 5             1283
## 6            52375

c) Looking for missing values:

Hurrah, a really clean dataset, even after searching for “empty” strings. 🙂

# Check for NAs:
sapply(ksdf, function(x) sum(is.na(x)))

##               ID             name         category    main_category 
##                0                0                0                0 
##         currency         deadline             goal         launched 
##                0                0                0                0 
##          pledged            state          backers          country 
##                0                0                0                0 
##      usd.pledged usd_pledged_real 
##             3797                0

# Check for empty strings:
nrow(subset(ksdf, is.na(ksdf$name)))

## [1] 0

d) Date Formatting and splitting:

We have two dates in our dataset – “launch date” and “deadline date”.We convert them from strings to date format.
We also split these dates into the respective year and month columns, so that we can plot variations over time.
So we will now have 4 new columns: launch_year, launch_month, deadline_year and deadline_month.

Exploratory analysis:

a) How many projects are successful?

prop.table(table(ksdf$state))*100

## 
##   canceled     failed       live successful  suspended  undefined 
## 10.2410864 52.2153060  0.7391836 35.3762336  0.4875073  0.9406831

We see that “failed” and “successful” are the two main categories, comprising ~88% of our dataset.
Sadly we do not know why some projects are marked “undefined” or “canceled”.
“live”” projects are those where the deadlines have not yet passed, although a few among them are already achieved their goal.
Surprisingly, some ‘canceled’ projects had also met their goals (pledged_amount >= goal).
Since these other categories are a very small portion of the dataset, we will subset and only consider records with satus “failed” or “successful” for the rest of the analysis.

b) How many countries have projects on kickstarter?

## 
##     AT     AU     BE     CA     CH     DE     DK     ES     FR     GB 
##    485   6616    523  12370    652   3436    926   1873   2520  29454 
##     HK     IE     IT     JP     LU     MX  N,0""     NL     NO     NZ 
##    477    683   2369     23     57   1411    210   2411    582   1274 
##     SE     SG     US 
##   1509    454 261360

We see projects are overwhelmingly US. Some country names have the tag N,0“”, so marking them as unknown.

c) Number of projects launched per year:

## 
##  2009  2010  2011  2012  2013  2014  2015  2016  2017 
##  1179  9577 24049 38480 41101 59306 65272 49292 43419

Looks like some records say dates like 1970, which does not look right. So we discard any records with a launch / deadline year before 2009.
Plotting the counts per year on a graphs: < br />From the graph below, it looks like the count of projects peaked in 2015, then went down. However, this should NOT be taken as an indicator of success rates.

Drilling down a bit more to see count of projects by main_category.

Over the years, maximum number of projects have been in the categories:

1. Film & Video
2. Music
3. Publishing

d) Number of projects by sub-category: (Top 20 only)

The Top 5 sub-categories are:

1. Product Design
2. Documentary
3. Music
4. Tabletop Games (interesting!!!)
5. Shorts (really?! )

Let us now see “Status” of projects for these Top 5 sub_categories:
From the graph below, we see that for category “shorts” and “tabletop games” there are more successfull projects than failed ones.

e) Backers by category and sub-category:

Since there are a lot of sub-categories, let us explore the sub-categories under the main theme “Design”

Product design is not just the sub-category with the highest count of projects, but also the category with the highest success ratio.

f) add flag to see how many got funded more than the goal.

ksdf$goal_flag <- ifelse(ksdf$pledged >= ksdf$goal, 1, 0)
prop.table(table(ksdf$goal_flag))*100

## 
##        0        1 
## 59.61197 40.38803

So ~40% of projects reached or surpassed their goal, which matches the number of successful projects .

g) Calculate average contribution per backer:

From the mean, median and max values we quickly see that the median amount contributed by each backer is only ~$40 whereas the mean is higher due to the extreme positive values. The max amount by a single backer is ~$5000.

summary(ksdf$contrib)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00    16.00    41.78    73.35    78.00 50000.00

hist(n$contrib, main = "Histogram for number of contributors")

h) Calculate reach_ratio

The amount per backer is a good start, but what if the goal amount itself is only $1000? Then an average contribution per backer of $50 impies we only need 20 backers.
So to better understand the probability of a project’s success, we create a derived metric called “reach_ratio”.
This takes the average user contribution and compares it against the goal fund amount.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      0.16      0.75      6.17      2.16 166666.67

We see the median reach_ratio is <1%. Only in the third quartile do we even touch 2%!
Clearly most projects have a very low reach ratio. We could subset for “successful” projects only and check if the reach_ratio is higher.

i) Number of days to achieve goal:

Most projects are run for a month, as seen from graph below.

Predictive Analystics:

We will apply a very simple decision tree algorithm to our dataset.
Since we do not have a separate “test” set, we will split the input dataframe into 2 parts (70/30 split).
We will use the smaller set to test the accuracy of out algorithm.

ksdf$status = ifelse(ksdf$state == 'failed', 0, 1)

## 70% of the sample size
smp_size <- floor(0.7 * nrow(ksdf))

## set the seed to make your partition reproductible
set.seed(486)
train_ind <- sample(seq_len(nrow(ksdf)), size = smp_size)

train <- ksdf[train_ind, ]
test <- ksdf[-train_ind, ]

library(tree)
tree1 <- tree(status ~ goal + reach_ratio + category + backers + country + launch_year , data = train)

## Warning in tree(status ~ goal + reach_ratio + category + backers + country
## + : NAs introduced by coercion

summary(tree1)

## 
## Regression tree:
## tree(formula = status ~ goal + reach_ratio + category + backers + 
##     country + launch_year, data = train)
## Variables actually used in tree construction:
## [1] "backers"     "reach_ratio"
## Number of terminal nodes:  9 
## Residual mean deviance:  0.02429 = 5640 / 232200 
## Distribution of residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.9816000 -0.0006945 -0.0006945  0.0000000  0.0410400  0.9993000

Taking a peek at the decision tree rules:

plot(tree1)
text(tree1 ,pretty =0)

kickstarter success decision tree

tree1

## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 232172 55900.00 0.4039000  
##    2) backers < 17.5 121911  9326.00 0.0834600  
##      4) reach_ratio < 5.88118 107986    74.95 0.0006945 *
##      5) reach_ratio > 5.88118 13925  2774.00 0.7253000  
##       10) backers < 5.5 4723  1031.00 0.3218000  
##         20) reach_ratio < 19.9667 2792     0.00 0.0000000 *
##         21) reach_ratio > 19.9667 1931   323.50 0.7872000 *
##       11) backers > 5.5 9202   580.00 0.9324000 *
##    3) backers > 17.5 110261 20210.00 0.7583000  
##      6) reach_ratio < 0.79672 40852 10190.00 0.5217000  
##       12) backers < 128.5 16858    39.91 0.0023730 *
##       13) backers > 128.5 23994  2413.00 0.8866000 *
##      7) reach_ratio > 0.79672 69409  6383.00 0.8975000  
##       14) backers < 35.5 20535  3836.00 0.7514000  
##         28) reach_ratio < 2.85458 4816     0.00 0.0000000 *
##         29) reach_ratio > 2.85458 15719   284.60 0.9816000 *
##       15) backers > 35.5 48874  1924.00 0.9590000 *

Thus we see that “backers” and “reach-ratio” are the main significant variables.

Re-applying the tree rules to the training set itself, we can validate our model:

Predt <- predict(tree1, train)

validf <- data.frame( kickstarter_id = train$ID, orig_status = train$status, new_status = Predt)
validf$new = ifelse(validf$new_status < 0.5, 0, 1)

# contingency Tables:
table(validf$orig_status, validf$new)

##    
##          0      1
##   0 132337   6051
##   1    115  93669

# Area under the curve
library(pROC)
auc(validf$orig_status, validf$new)

## Area under the curve: 0.9775

From the above tables, we see that the error rate = ~3% and area under curve >= 97%

Finally applying the tree rules to the test set, we get the following stats:

Pred1 <- predict(tree1, test)

From the above tables, we see that still the error rate = ~3% and area under curve >= 97%

Conclusion:

Thus in this tutorial, we explored the factors that contribtue to a project’s success. Main theme and sub-category were important, but the number of backers and “reach_ratio” were found to be most critical.
If a founder wanted to gauge their probability of success, they could measure their “reach-ratio” halfway to the deadline, or perhaps when 25% of the timeline is complete. If the numbers are lower, it means they need to double down and use promotions/social media marketing to get more backers and funding.

If you liked this tutorial, feel free to fork the script. And don’t forget to upvote the kernel! 🙂

Who wants to work at Google?

January 21, 2018 / Anu Rajaram / 1 Comment

In this tutorial, we will explore the open roles at Google, and try to see what common attributes Google is looking for, in future employees.

This dataset is a compilation of job descriptions of 1200+ open roles at Google offices across the world. This dataset is available for download from the Kaggle website, and contains text information about job location, title, department, minimum, preferred qualifications and responsibilities of the position. You can download the dataset here, and run the code on the Kaggle site itself here.

Using this dataset we will try to answer the following questions:

Where are the open roles?
Which departments have the most openings?
What are the minimum and preferred educational qualifications needed to get hired at Google?
How much experience is needed?
What categories of roles are the most in demand?

Step1 – Data Preparation and Cleaning:

The data is all in free-form text, so we do need to do a fair amount of cleanup to remove non-alphanumeric characters. Some of the job locations have special characters too, so we remove those using basic string manipulation functions. Once we read in the file, this is the snapshot of the resulting dataframe:

Step 2 – Analysis:

Now we will use R programming to identify patterns in the data that help us answer the questions of interest.

a) Job Categories:

First let us look at which departments have the most number of open roles. Surprisingly, there are more roles open for the “Marketing and Communications” and “Sales & Account Management” categories, as compared to the traditional technical business units. (like Software Engineering or networking) .

b) Full-time versus internships:

Let us see how many roles are full-time and how many are for students. As expected, only ~13% of roles are for students i.e. internships. Majority are full-time positions.

c) Technical Roles:

Since Google is predominantly technical company, let us see how many positions need technical skills, irrespective of the business unit (job category)

a) Roles related to “Google Cloud”:

To check this, we investigate how many roles have the phrase either in the job title or the responsibilities. As shown in the graph below, ~20% of the roles are related to Cloud infrastructure, clearly showing that Google is making Cloud services a high priority.

b) Senior Roles and skills :

A quick word search also reveals how many senior roles (roles that require 10+ years of experience) use the word “strategy” in their list of requirements, under either qualifications or responsibilities. Word association analysis can also show this. (not shown here).

Educational Qualifications:

Here we are basically parsing the “min_qual” and “pref_qual” columns to see the minimum qualifications needed for the role. If we only take the minimum qualifications into consideration, we see that 80% of the roles explicitly ask for a bachelors degree. Less than 5% of roles ask for a masters or PhD.

However, when we consider the “preferred” qualifications, the ratio increases to a whopping ~25%. Thus, a fourth of all roles would be more suited to candidates with masters degrees and above.

Google Engineers:

Google is famous for hiring engineers for all types of roles. So we will read the job qualification requirements to identify what percentage of roles requires a technical degree or degree in Engineering.
As seen from the data, 35% specifically ask for an Engineering or computer science degree, including roles in marketing and non-engineering departments.

Years of Experience:

We see that 30% of the roles require at least 5-years, while 35% of roles need even more experience.
So if you did not get hired at Google after graduation, no worries. You have a better chance after gaining a strong experience in other companies.

Role Locations:

The dataset does not have the geographical coordinates for mapping. However, this is easily overcome by using the geocode() function and the amazing Rworldmap package. We are only plotting the locations, so some places would have more roles than others. So, we see open roles in all parts of the world. However, the maximum positions are in US, followed by UK, and then Europe as a whole.

Responsibilities – Word Cloud:

Let us create a word cloud to see what skills are most needed for the Cloud engineering roles: We see that words like “partner”, “custom solutions”, “cloud”, strategy“,”experience” are more frequent than any specific technical skills. This shows that the Google cloud roles are best filled by senior resources where leadership and business skills become more significant than expertise in a specific technology.

Conclusion:

So who has the best chance of getting hired at Google?

For most of the roles (from this dataset), a candidate with the following traits has the best chance of getting hired:

5+ years of experience.
Engineering or Computer Science bachelor’s degree.
Masters degree or higher.
Working in the US.

The code for this script and graphs are available here on the Kaggle website. If you liked it, don’t forget to upvote the script. 🙂 And don’t forget to share!

Next Steps:

You can tweak the code to perform the same analysis, but on a subset of data. For example, only roles in a specific department, location (HQ in California) or Google Cloud related roles.

Thanks and happy coding!

(Please note that this post has been reposted from the main blog site at http://blog.journeyofanalytics.com/ )

Zillow Rent Analysis

August 19, 2017 / Anu Rajaram / 2 Comments

Hello Readers,

This is a notification post – Did you realize our website has moved? The blog is live at New JA Blog under the domain http://www.journeyofanalytics.com . You can read about the rent analysis post here.

If you received this post AND an email from anu_analytics, then please disregard this post.

If you received this post update from WordPress, but did NOT receive an email from anu_analytics (via MailChimp email) then please send us an email at anuprv@journeyofanalytics.com . The email from the main site was sent out 4 hours ago. Alternatively, you can sign up using this contact form.

(Email screenshot below)

JourneyofAnalytics Newsletter

Again, the latest blogposts are available at blog.journeyofanalytics.com and all code/project files are available under the Projects page.

See you at our new site. Happy Coding!

Journey of Analytics

Deep dive into data analysis tools, theory and projects

How to raise money on Kickstarter – tutorial with EDA and predictions

Qs for Exploratory Analysis:

Step1 – Data Pre-processing

a) Let us take a look at the input dataset :

b) Now let us look at the first 5 records:

c) Looking for missing values:

d) Date Formatting and splitting:

a) How many projects are successful?

d) Number of projects by sub-category: (Top 20 only)

e) Backers by category and sub-category:

f) add flag to see how many got funded more than the goal.

g) Calculate average contribution per backer:

h) Calculate reach_ratio

i) Number of days to achieve goal:

Predictive Analystics:

Conclusion:

Who wants to work at Google?

Step1 – Data Preparation and Cleaning:

Step 2 – Analysis:

a) Job Categories:

b) Full-time versus internships:

c) Technical Roles:

Educational Qualifications:

Google Engineers:

Years of Experience:

Role Locations:

Responsibilities – Word Cloud:

Conclusion:

Next Steps:

Zillow Rent Analysis