My Titanic journey!

 

Today’s post is an overview of my experiments with the Titanic Kaggle competition. My goal with this project was mainly to identify which machine learning algorithms work best with categorical response variables.

The dataset for this competition is freely available on the Kaggle website ( link here) and my code in R is available on Github repository.

The project and some other R-programs have also been newly added to my Project page.

Becoming a Data Detective:

Exploring the data, these were my observations from the training set:

  • Women had a much better chance of survival than men. data-detective
  • However, senior women in passenger class (pclass) 2 had low odds of survival.
  • Being a female child in pclass3 with 3+ siblings worsens survival rate. (I refuse to believe in misogynist parents, so possibly these girls were left behind in the pandemonium)
  • Among men, male children (<18 years) in class 1 and 2 had better odds of survival.
  • However, being a male child in pclass 2 with no parents or no siblings (Parch = 0, SibSp = 0) dramatically reduces survival rate.

 

Kaggle Submissions and their Comparisons:

For easy understanding, I have tabulated my results, you can look up the appropriate programs to view the code.

No. Algorithm Program Name Kaggle Score Remarks
1 Survival based on gender classification Gender.R 0.76555 Baseline program, for comparing all other comparisons.
2 Logic rules based on broad observations from the training set Logic_rules.R 0.77033 Better than simple gender classification.
3 Naïve- Bayes theorem Nb_titanic.R 0.74641 Classic case of overfitting! Performed worse than even the simple gender classification!
4 neural-net theorem Nnet_titanic.R 0.77033 Classification rules were eerily similar to my own logic rules, so I was delighted!
5 RandomForest algorithm Rf_titanic.R 0.77512 I love random forest because it doesn’t suffer from over-fitting and the depth of tree splits can be controlled. Plus, tree-type classification rules are easily applied to real-world scenarios. Best, easy to explain even non-technical folks!
6 tree Tree.R 0.78947 HIGH score. Sometimes simple works best! Note, playing with derivative variables did NOT boost scores further.

Conclusion:

It is well-known that Kaggle competitions are a great way to apply machine learning concepts and algorithms. However, this particular Titanic dataset taught a couple of interesting points:

  1. Data exploration is very important. As seen by the gender prediction score, we can make ~76% correct predictions simply by classifying according to gender and ignoring everything else.
  2. Complex, new-fangled algorithms don’t always work better, as seen by the dismal score with the Naïve-Bayes test.
  3. Test, test, test your hypothesis.

 

So, what was your experience with Kaggle Titanic? Did you use a totally different algorithm to help boost your scores?

Advertisements

Please share your feedback and opinions. Thanks!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s