Forecasting with Linear Models

DOWNLOAD THE NOTEBOOK

We are going to use Kaggle Titanic competition dataset. In which people are competing to make the BEST estimates. Two datasets are given:

  • train.csv: It contains the attributes (X) of people and Survival status (Y).
  • test.csv: It contains only attributes (X). You have to make predictions $\hat{Y}$ to score your ML.

We will import sklearn.linear_model library which contains linear models to make predictions.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
%matplotlib inline
In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
In [3]:
train.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

We must use categorical and numeric parameters to make estimations.

  • Pclass: Is a categorical variable. We will convert it into dummies.
  • Name: We cannot use name for predictions.
  • Sex: It is a categorical variable we can use it by converting to dummy.
  • Age: Age is a numerical variable.
  • SibSp: # of siblings / spouses aboard the Titanic
  • Parch: # of parents / children aboard the Titanic
  • Ticket: Ticket number is difficult to use.
  • Fare: Passenger fare.
  • Cabin: It is a categorical variable.
  • Embarked: Port of Embarkation is a categorical variable.
In [4]:
test.head()
Out[4]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

As you see we do not have Survival status. We will make predictions.

I will specifically pickup some columns and manipulate them to make ready for analysis. Note that all paramaters must be numeric or boolean. We also drop one of the dummy variables for each category. Check out: Perfect Multicollinearity.

In [5]:
def to_x_y(df_original, test = False):
    train = df_original.copy()
    cabin = ~train['Cabin'].isnull()
    cabin.name = 'Cabin'
    if test == False:
        y = train['Survived'].copy()
    else:
        y = False
    x = pd.concat([
        pd.get_dummies(train['Sex']),# we need to drop one of them
        cabin,
        pd.get_dummies(train['Embarked']), # we need to drop one of them
        pd.get_dummies(train['Pclass']), # we need to drop one of them
        train['Age'],
        train['SibSp'],
        train['Parch'],
        train['Fare']
    ]
        , axis=1).copy()
    del x['female'] # reference category for Sex
    del x['C'] #reference category for Embarked
    del x[1] # reference category for Pclass
    return x, y
In [6]:
x, y = to_x_y(train)

Gender data set contains the results(y) and test data contains the x paramaters. So we need to merge those to be like train dataset.

In [7]:
x_test, y_test = to_x_y(train, test=True)
In [8]:
x = x.fillna(x.median())
In [9]:
x_test = x.fillna(x_test.median())

I filled the NaN values with the median. You can use another method to fill NaN values. Even, you can make estimations about NaN values within the x paramaters to fill.


Preparing data for estimation is the most critical part of econometric estimation.

Models

Find the documentations for linear models in the sklearn library.

Logistic Regression

Okay... We will be using several models to estimate Titanic Survival. Let's begin with the Logistic Regression...

In [10]:
log_fit = linear_model.LogisticRegression(max_iter = 300000)
In [11]:
log_fit.fit(x,y)
Out[11]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=300000, multi_class='ovr',
          n_jobs=1, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)
In [12]:
log_fit.score(x,y)
Out[12]:
0.81369248035914699

The first results are not bad as well. Now we can make predictions with our model:

In [13]:
log_fit.predict(x_test)
Out[13]:
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0], dtype=int64)

I will not include the latter predictions to the notebook.

Linear Regression

In [14]:
ln_fit = linear_model.LinearRegression()
In [15]:
ln_fit.fit(x,y)
Out[15]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [16]:
ln_fit.score(x,y)
Out[16]:
0.40543897139713581

This score is obviously is very bad... Let's try another.

Least Angle Regression model a.k.a. LAR

In [17]:
LAR = linear_model.Lars()
In [18]:
LAR.fit(x,y)
Out[18]:
Lars(copy_X=True, eps=2.2204460492503131e-16, fit_intercept=True,
   fit_path=True, n_nonzero_coefs=500, normalize=True, positive=False,
   precompute='auto', verbose=False)
In [19]:
LAR.score(x,y)
Out[19]:
0.40543897139713581

Similar results with the Linear regression... Let's try next.

LASSO

In [20]:
LASSO = linear_model.Lasso(alpha = 0.1)
In [21]:
LASSO.fit(x,y)
Out[21]:
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
In [22]:
LASSO.score(x,y)
Out[22]:
0.13091591689496429

Somehow, LASSO diverges. It looks like I couldn't optimize the parameters. Gives the best results when alpha = 0 but it is no different from OLS.

Ridge Classifier

In [23]:
rige = linear_model.RidgeClassifier(max_iter = 300000)
In [24]:
rige.fit(x,y)
Out[24]:
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=300000, normalize=False, random_state=None, solver='auto',
        tol=0.001)
In [25]:
rige.score(x,y)
Out[25]:
0.81144781144781142

Ridge Classifier works almost as well as Logistic Regression.

PassiveAggressiveClassifier

In [26]:
PAC = linear_model.PassiveAggressiveClassifier(max_iter=100000)
In [27]:
PAC.fit(x,y)
Out[27]:
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              fit_intercept=True, loss='hinge', max_iter=100000,
              n_iter=None, n_jobs=1, random_state=None, shuffle=True,
              tol=None, verbose=0, warm_start=False)
In [28]:
PAC.score(x,y)
Out[28]:
0.71717171717171713

Not as good as Ridge and Logistic.

SGDClassifier

In [29]:
SGD = linear_model.SGDClassifier(max_iter = 300000)
In [30]:
SGD.fit(x,y)
Out[30]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=300000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)
In [31]:
SGD.score(x,y)
Out[31]:
0.79012345679012341

Conclusion Remarks

Note that, some methods contains stochastic processes, so each time you make fit the model, results change. From our results Logistic Regressin, Ridge Classifer and SGD classifer give the best results. Optimizing their paramaters we can get better results. But keep in mind that:

  • There is no best model
  • All models have pros and cons
  • Before model selection you need to clear and setup your data as we did. For instance filling NaN values with mediam values might be a bad idea.
  • After simple comparison of results choose one or more models to optimize.