In this article I would like to show you a practice with R using the famous Titanic dataset. The truth is that it is a well-known exercise and where you can find a lot of documentation about it on the network but I wanted to show you how I did this practice in a way, I think simple, to get two predictive models and know how effectively a passenger would have survived on the Titanic.
Before showing you the videos where I explain the exercise in detail, I want to show you the steps that I have followed to achieve these machine learning models. The first thing you must do to follow me in this exercise is download the Titanic dataset, for this you can do it from different repositories, to facilitate your work I have left you the link to the dataset that I use in this exercise.
Once we have this csv file downloaded to our computer, we will look for a working folder where we will carry out the programming with RStudio. Remind you that R is a programming language focused above all on mathematics and statistics, two fundamental pillars for machine learning.
Once we have our Rstudio open and chosen our workspace with the CSV file we can start working with the first video, which I have titled «Analysis and Data Cleaning»
In this first video, basically what we will do is prepare the data to perform the predictive models a posteriori, for this we must manage all those null data that we find and qualitative data that we will have to convert into quantitative with some Data Cleaning techniques.
Video 1 - Prepare the Titanic Data
If you followed the first video without problems, you will now have a dataset prepared to implement a prediction model. In this first video we have shown different techniques to be able to clean the data correctly, simply deleting columns (explanatory variables) that would have little relevance in our model or using the mean value of an explanatory variable to place it in the null values and thus way to avoid the deletion of that data.
Another technique that we have seen has been the creation of Dummies variables, as a quantitative variable such as the sex of the passenger or the place of embarkation, we have converted it into Boolean values so that it can be entered into the model quantitatively.
Now in this second video we are going to explain the Logistic Regression model, a model that helps us classify an input (input) between two options, 1 or 0 (output).
In this scenario, this model fits us perfectly since the value we want to predict is the survival of our passenger, 1 if he survived or 0 if he died. This prediction is achieved by passing the value of our explanatory variables (age, rate, shipping, brothers ...) by a sigmoid function, which we explain in the following video.
Video 2 - Logistic Regression Model
Well, now that we have our first prediction model and we have been able to check its effectiveness, for over 90%, what we can do is make a second model and compare them.
In this third video we are going to show a model based on a decision tree, it is one of the most used models and probably easy to understand without going into much mathematical detail.
A decision tree in Machine Learning is a tree structure similar to a flowchart where an internal node represents a characteristic (or attribute), the branch represents a decision rule, and each leaf node represents the result. To measure and value these decision rules, it uses various functions, the best known being the Gini Index and Gain of information used by the so-called entropy.
Video 3 - Decision tree
Finally we can conclude that our decision tree model is slightly better than our model based on logistic regression. Here we finish this practice of R with the Titanic dataset where in a not very complicated way we can enter the world of machine learning.