{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Forecasting with Linear Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[DOWNLOAD THE NOTEBOOK](notebooks/titanic.ipynb)\n",
"\n",
"We are going to use [Kaggle Titanic](https://www.kaggle.com/c/titanic) competition dataset. In which people are competing to make the BEST estimates. Two datasets are given:\n",
"- [train.csv](assets/titanic/train.csv): It contains the attributes (X) of people and Survival status (Y).\n",
"- [test.csv](assets/titanic/test.csv): It contains only attributes (X). You have to make predictions $\\hat{Y}$ to score your ML. \n",
"\n",
"We will import `sklearn.linear_model` library which contains linear models to make predictions. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn import linear_model\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train = pd.read_csv('train.csv')\n",
"test = pd.read_csv('test.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund, Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings, Mrs. John Bradley (Florence Briggs Th... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen, Miss. Laina | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.9250 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen, Mr. William Henry | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.0500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We must use `categorical` and `numeric` parameters to make estimations. \n",
"- Pclass: Is a categorical variable. We will convert it into dummies.\n",
"- Name: We cannot use name for predictions.\n",
"- Sex: It is a categorical variable we can use it by converting to dummy.\n",
"- Age: Age is a numerical variable.\n",
"- SibSp: # of siblings / spouses aboard the Titanic\t\n",
"- Parch: # of parents / children aboard the Titanic\t\n",
"- Ticket: Ticket number is difficult to use. \n",
"- Fare: Passenger fare. \n",
"- Cabin: It is a categorical variable.\n",
"- Embarked: Port of Embarkation is a categorical variable."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 892 | \n",
" 3 | \n",
" Kelly, Mr. James | \n",
" male | \n",
" 34.5 | \n",
" 0 | \n",
" 0 | \n",
" 330911 | \n",
" 7.8292 | \n",
" NaN | \n",
" Q | \n",
"
\n",
" \n",
" 1 | \n",
" 893 | \n",
" 3 | \n",
" Wilkes, Mrs. James (Ellen Needs) | \n",
" female | \n",
" 47.0 | \n",
" 1 | \n",
" 0 | \n",
" 363272 | \n",
" 7.0000 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 2 | \n",
" 894 | \n",
" 2 | \n",
" Myles, Mr. Thomas Francis | \n",
" male | \n",
" 62.0 | \n",
" 0 | \n",
" 0 | \n",
" 240276 | \n",
" 9.6875 | \n",
" NaN | \n",
" Q | \n",
"
\n",
" \n",
" 3 | \n",
" 895 | \n",
" 3 | \n",
" Wirz, Mr. Albert | \n",
" male | \n",
" 27.0 | \n",
" 0 | \n",
" 0 | \n",
" 315154 | \n",
" 8.6625 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 896 | \n",
" 3 | \n",
" Hirvonen, Mrs. Alexander (Helga E Lindqvist) | \n",
" female | \n",
" 22.0 | \n",
" 1 | \n",
" 1 | \n",
" 3101298 | \n",
" 12.2875 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Pclass Name Sex \\\n",
"0 892 3 Kelly, Mr. James male \n",
"1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n",
"2 894 2 Myles, Mr. Thomas Francis male \n",
"3 895 3 Wirz, Mr. Albert male \n",
"4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n",
"\n",
" Age SibSp Parch Ticket Fare Cabin Embarked \n",
"0 34.5 0 0 330911 7.8292 NaN Q \n",
"1 47.0 1 0 363272 7.0000 NaN S \n",
"2 62.0 0 0 240276 9.6875 NaN Q \n",
"3 27.0 0 0 315154 8.6625 NaN S \n",
"4 22.0 1 1 3101298 12.2875 NaN S "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you see we do not have `Survival` status. We will make predictions. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will specifically pickup some columns and manipulate them to make ready for analysis. Note that all paramaters must be numeric or boolean. We also drop one of the dummy variables for each category. Check out: [Perfect Multicollinearity](https://www.google.com.tr/search?q=perfect+multicollinearity&oq=perfect+mult&aqs=chrome.1.69i57j0l5.2646j0j7&sourceid=chrome&ie=UTF-8). "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def to_x_y(df_original, test = False):\n",
" train = df_original.copy()\n",
" cabin = ~train['Cabin'].isnull()\n",
" cabin.name = 'Cabin'\n",
" if test == False:\n",
" y = train['Survived'].copy()\n",
" else:\n",
" y = False\n",
" x = pd.concat([\n",
" pd.get_dummies(train['Sex']),# we need to drop one of them\n",
" cabin,\n",
" pd.get_dummies(train['Embarked']), # we need to drop one of them\n",
" pd.get_dummies(train['Pclass']), # we need to drop one of them\n",
" train['Age'],\n",
" train['SibSp'],\n",
" train['Parch'],\n",
" train['Fare']\n",
" ]\n",
" , axis=1).copy()\n",
" del x['female'] # reference category for Sex\n",
" del x['C'] #reference category for Embarked\n",
" del x[1] # reference category for Pclass\n",
" return x, y"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"x, y = to_x_y(train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Gender data set contains the results(y) and test data contains the x paramaters. So we need to merge those to be like train dataset."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"x_test, y_test = to_x_y(train, test=True)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"x = x.fillna(x.median())"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"x_test = x.fillna(x_test.median())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I filled the NaN values with the median. You can use another method to fill NaN values. Even, you can make estimations about NaN values within the `x` paramaters to fill. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"*Preparing data for estimation is the most critical part of econometric estimation.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the documentations for [linear models](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) in the `sklearn` library. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Logistic Regression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay... We will be using several models to estimate Titanic Survival. Let's begin with the Logistic Regression..."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"log_fit = linear_model.LogisticRegression(max_iter = 300000)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=300000, multi_class='ovr',\n",
" n_jobs=1, penalty='l2', random_state=None, solver='liblinear',\n",
" tol=0.0001, verbose=0, warm_start=False)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"log_fit.fit(x,y)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.81369248035914699"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"log_fit.score(x,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first results are not bad as well. Now we can make predictions with our model:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,\n",
" 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,\n",
" 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,\n",
" 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,\n",
" 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,\n",
" 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,\n",
" 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,\n",
" 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,\n",
" 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0,\n",
" 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,\n",
" 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,\n",
" 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,\n",
" 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,\n",
" 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,\n",
" 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,\n",
" 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0,\n",
" 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1,\n",
" 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,\n",
" 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,\n",
" 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,\n",
" 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,\n",
" 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,\n",
" 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,\n",
" 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,\n",
" 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,\n",
" 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,\n",
" 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1,\n",
" 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,\n",
" 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,\n",
" 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,\n",
" 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,\n",
" 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0], dtype=int64)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"log_fit.predict(x_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will not include the latter predictions to the notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linear Regression"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ln_fit = linear_model.LinearRegression()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ln_fit.fit(x,y)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.40543897139713581"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ln_fit.score(x,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This score is obviously is very bad... Let's try another."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Least Angle Regression model a.k.a. LAR"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"LAR = linear_model.Lars()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Lars(copy_X=True, eps=2.2204460492503131e-16, fit_intercept=True,\n",
" fit_path=True, n_nonzero_coefs=500, normalize=True, positive=False,\n",
" precompute='auto', verbose=False)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LAR.fit(x,y)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.40543897139713581"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LAR.score(x,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar results with the Linear regression... Let's try next."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## LASSO"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"LASSO = linear_model.Lasso(alpha = 0.1)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,\n",
" normalize=False, positive=False, precompute=False, random_state=None,\n",
" selection='cyclic', tol=0.0001, warm_start=False)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LASSO.fit(x,y)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.13091591689496429"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LASSO.score(x,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Somehow, LASSO diverges. It looks like I couldn't optimize the parameters. Gives the best results when `alpha = 0` but it is no different from OLS."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ridge Classifier"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"rige = linear_model.RidgeClassifier(max_iter = 300000)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,\n",
" max_iter=300000, normalize=False, random_state=None, solver='auto',\n",
" tol=0.001)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rige.fit(x,y)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.81144781144781142"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rige.score(x,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ridge Classifier works almost as well as Logistic Regression."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PassiveAggressiveClassifier"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"PAC = linear_model.PassiveAggressiveClassifier(max_iter=100000)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,\n",
" fit_intercept=True, loss='hinge', max_iter=100000,\n",
" n_iter=None, n_jobs=1, random_state=None, shuffle=True,\n",
" tol=None, verbose=0, warm_start=False)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"PAC.fit(x,y)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.71717171717171713"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"PAC.score(x,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not as good as Ridge and Logistic."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SGDClassifier"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"SGD = linear_model.SGDClassifier(max_iter = 300000)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,\n",
" eta0=0.0, fit_intercept=True, l1_ratio=0.15,\n",
" learning_rate='optimal', loss='hinge', max_iter=300000, n_iter=None,\n",
" n_jobs=1, penalty='l2', power_t=0.5, random_state=None,\n",
" shuffle=True, tol=None, verbose=0, warm_start=False)"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"SGD.fit(x,y)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.79012345679012341"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"SGD.score(x,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion Remarks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that, some methods contains stochastic processes, so each time you make fit the model, results change. From our results Logistic Regressin, Ridge Classifer and SGD classifer give the best results. Optimizing their paramaters we can get better results. But keep in mind that:\n",
"\n",
"- There is no best model\n",
"- All models have pros and cons\n",
"- Before model selection you need to clear and setup your data as we did. For instance filling `NaN` values with mediam values might be a bad idea. \n",
"- After simple comparison of results choose one or more models to optimize."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}