Today’s post is an overview of my experiments with the Titanic Kaggle competition. My goal with this project was mainly to identify which machine learning algorithms work best with categorical response variables.
The project and some other R-programs have also been newly added to my Project page.
Becoming a Data Detective:
Exploring the data, these were my observations from the training set:
- Women had a much better chance of survival than men.
- However, senior women in passenger class (pclass) 2 had low odds of survival.
- Being a female child in pclass3 with 3+ siblings worsens survival rate. (I refuse to believe in misogynist parents, so possibly these girls were left behind in the pandemonium)
- Among men, male children (<18 years) in class 1 and 2 had better odds of survival.
- However, being a male child in pclass 2 with no parents or no siblings (Parch = 0, SibSp = 0) dramatically reduces survival rate.
Kaggle Submissions and their Comparisons:
For easy understanding, I have tabulated my results, you can look up the appropriate programs to view the code.
|No.||Algorithm||Program Name||Kaggle Score||Remarks|
|1||Survival based on gender classification||Gender.R||0.76555||Baseline program, for comparing all other comparisons.|
|2||Logic rules based on broad observations from the training set||Logic_rules.R||0.77033||Better than simple gender classification.|
|3||Naïve- Bayes theorem||Nb_titanic.R||0.74641||Classic case of overfitting! Performed worse than even the simple gender classification!|
|4||neural-net theorem||Nnet_titanic.R||0.77033||Classification rules were eerily similar to my own logic rules, so I was delighted!|
|5||RandomForest algorithm||Rf_titanic.R||0.77512||I love random forest because it doesn’t suffer from over-fitting and the depth of tree splits can be controlled. Plus, tree-type classification rules are easily applied to real-world scenarios. Best, easy to explain even non-technical folks!|
|6||tree||Tree.R||0.78947||HIGH score. Sometimes simple works best! Note, playing with derivative variables did NOT boost scores further.|
It is well-known that Kaggle competitions are a great way to apply machine learning concepts and algorithms. However, this particular Titanic dataset taught a couple of interesting points:
- Data exploration is very important. As seen by the gender prediction score, we can make ~76% correct predictions simply by classifying according to gender and ignoring everything else.
- Complex, new-fangled algorithms don’t always work better, as seen by the dismal score with the Naïve-Bayes test.
- Test, test, test your hypothesis.
So, what was your experience with Kaggle Titanic? Did you use a totally different algorithm to help boost your scores?