Forecasting with Linear Models¶

DOWNLOAD THE NOTEBOOK

We are going to use Kaggle Titanic competition dataset. In which people are competing to make the BEST estimates. Two datasets are given:

train.csv: It contains the attributes (X) of people and Survival status (Y).
test.csv: It contains only attributes (X). You have to make predictions $\hat{Y}$ to score your ML.

We will import sklearn.linear_model library which contains linear models to make predictions.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
%matplotlib inline

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.head()

We must use categorical and numeric parameters to make estimations.

Pclass: Is a categorical variable. We will convert it into dummies.
Name: We cannot use name for predictions.
Sex: It is a categorical variable we can use it by converting to dummy.
Age: Age is a numerical variable.
SibSp: # of siblings / spouses aboard the Titanic
Parch: # of parents / children aboard the Titanic
Ticket: Ticket number is difficult to use.
Fare: Passenger fare.
Cabin: It is a categorical variable.
Embarked: Port of Embarkation is a categorical variable.

test.head()

As you see we do not have Survival status. We will make predictions.

I will specifically pickup some columns and manipulate them to make ready for analysis. Note that all paramaters must be numeric or boolean. We also drop one of the dummy variables for each category. Check out: Perfect Multicollinearity.

def to_x_y(df_original, test = False):
    train = df_original.copy()
    cabin = ~train['Cabin'].isnull()
    cabin.name = 'Cabin'
    if test == False:
        y = train['Survived'].copy()
    else:
        y = False
    x = pd.concat([
        pd.get_dummies(train['Sex']),# we need to drop one of them
        cabin,
        pd.get_dummies(train['Embarked']), # we need to drop one of them
        pd.get_dummies(train['Pclass']), # we need to drop one of them
        train['Age'],
        train['SibSp'],
        train['Parch'],
        train['Fare']
    ]
        , axis=1).copy()
    del x['female'] # reference category for Sex
    del x['C'] #reference category for Embarked
    del x[1] # reference category for Pclass
    return x, y

x, y = to_x_y(train)

Gender data set contains the results(y) and test data contains the x paramaters. So we need to merge those to be like train dataset.

x_test, y_test = to_x_y(train, test=True)

x = x.fillna(x.median())

x_test = x.fillna(x_test.median())

I filled the NaN values with the median. You can use another method to fill NaN values. Even, you can make estimations about NaN values within the x paramaters to fill.

Preparing data for estimation is the most critical part of econometric estimation.

Models¶

Find the documentations for linear models in the sklearn library.

Logistic Regression¶

Okay... We will be using several models to estimate Titanic Survival. Let's begin with the Logistic Regression...

log_fit = linear_model.LogisticRegression(max_iter = 300000)

log_fit.fit(x,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=300000, multi_class='ovr',
          n_jobs=1, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

log_fit.score(x,y)

0.81369248035914699

The first results are not bad as well. Now we can make predictions with our model:

log_fit.predict(x_test)

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0], dtype=int64)

I will not include the latter predictions to the notebook.

Linear Regression¶

ln_fit = linear_model.LinearRegression()

ln_fit.fit(x,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

ln_fit.score(x,y)

0.40543897139713581

This score is obviously is very bad... Let's try another.

Least Angle Regression model a.k.a. LAR¶

LAR = linear_model.Lars()

LAR.fit(x,y)

Lars(copy_X=True, eps=2.2204460492503131e-16, fit_intercept=True,
   fit_path=True, n_nonzero_coefs=500, normalize=True, positive=False,
   precompute='auto', verbose=False)

LAR.score(x,y)

0.40543897139713581

Similar results with the Linear regression... Let's try next.

LASSO¶

LASSO = linear_model.Lasso(alpha = 0.1)

LASSO.fit(x,y)

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

LASSO.score(x,y)

0.13091591689496429

Somehow, LASSO diverges. It looks like I couldn't optimize the parameters. Gives the best results when alpha = 0 but it is no different from OLS.

Ridge Classifier¶

rige = linear_model.RidgeClassifier(max_iter = 300000)

rige.fit(x,y)

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=300000, normalize=False, random_state=None, solver='auto',
        tol=0.001)

rige.score(x,y)

0.81144781144781142

Ridge Classifier works almost as well as Logistic Regression.

PassiveAggressiveClassifier¶

PAC = linear_model.PassiveAggressiveClassifier(max_iter=100000)

PAC.fit(x,y)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              fit_intercept=True, loss='hinge', max_iter=100000,
              n_iter=None, n_jobs=1, random_state=None, shuffle=True,
              tol=None, verbose=0, warm_start=False)

PAC.score(x,y)

0.71717171717171713

Not as good as Ridge and Logistic.

SGDClassifier¶

SGD = linear_model.SGDClassifier(max_iter = 300000)

SGD.fit(x,y)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=300000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

SGD.score(x,y)

0.79012345679012341

Conclusion Remarks¶

Note that, some methods contains stochastic processes, so each time you make fit the model, results change. From our results Logistic Regressin, Ridge Classifer and SGD classifer give the best results. Optimizing their paramaters we can get better results. But keep in mind that:

There is no best model
All models have pros and cons
Before model selection you need to clear and setup your data as we did. For instance filling NaN values with mediam values might be a bad idea.
After simple comparison of results choose one or more models to optimize.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S