We are going to use Kaggle Titanic competition dataset. In which people are competing to make the BEST estimates. Two datasets are given:
We will import sklearn.linear_model
library which contains linear models to make predictions.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
%matplotlib inline
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()
We must use categorical
and numeric
parameters to make estimations.
test.head()
As you see we do not have Survival
status. We will make predictions.
I will specifically pickup some columns and manipulate them to make ready for analysis. Note that all paramaters must be numeric or boolean. We also drop one of the dummy variables for each category. Check out: Perfect Multicollinearity.
def to_x_y(df_original, test = False):
train = df_original.copy()
cabin = ~train['Cabin'].isnull()
cabin.name = 'Cabin'
if test == False:
y = train['Survived'].copy()
else:
y = False
x = pd.concat([
pd.get_dummies(train['Sex']),# we need to drop one of them
cabin,
pd.get_dummies(train['Embarked']), # we need to drop one of them
pd.get_dummies(train['Pclass']), # we need to drop one of them
train['Age'],
train['SibSp'],
train['Parch'],
train['Fare']
]
, axis=1).copy()
del x['female'] # reference category for Sex
del x['C'] #reference category for Embarked
del x[1] # reference category for Pclass
return x, y
x, y = to_x_y(train)
Gender data set contains the results(y) and test data contains the x paramaters. So we need to merge those to be like train dataset.
x_test, y_test = to_x_y(train, test=True)
x = x.fillna(x.median())
x_test = x.fillna(x_test.median())
I filled the NaN values with the median. You can use another method to fill NaN values. Even, you can make estimations about NaN values within the x
paramaters to fill.
Preparing data for estimation is the most critical part of econometric estimation.
Find the documentations for linear models in the sklearn
library.
Okay... We will be using several models to estimate Titanic Survival. Let's begin with the Logistic Regression...
log_fit = linear_model.LogisticRegression(max_iter = 300000)
log_fit.fit(x,y)
log_fit.score(x,y)
The first results are not bad as well. Now we can make predictions with our model:
log_fit.predict(x_test)
I will not include the latter predictions to the notebook.
ln_fit = linear_model.LinearRegression()
ln_fit.fit(x,y)
ln_fit.score(x,y)
This score is obviously is very bad... Let's try another.
LAR = linear_model.Lars()
LAR.fit(x,y)
LAR.score(x,y)
Similar results with the Linear regression... Let's try next.
LASSO = linear_model.Lasso(alpha = 0.1)
LASSO.fit(x,y)
LASSO.score(x,y)
Somehow, LASSO diverges. It looks like I couldn't optimize the parameters. Gives the best results when alpha = 0
but it is no different from OLS.
rige = linear_model.RidgeClassifier(max_iter = 300000)
rige.fit(x,y)
rige.score(x,y)
Ridge Classifier works almost as well as Logistic Regression.
PAC = linear_model.PassiveAggressiveClassifier(max_iter=100000)
PAC.fit(x,y)
PAC.score(x,y)
Not as good as Ridge and Logistic.
SGD = linear_model.SGDClassifier(max_iter = 300000)
SGD.fit(x,y)
SGD.score(x,y)
Note that, some methods contains stochastic processes, so each time you make fit the model, results change. From our results Logistic Regressin, Ridge Classifer and SGD classifer give the best results. Optimizing their paramaters we can get better results. But keep in mind that:
NaN
values with mediam values might be a bad idea.