Framework of this course and some examples are taken from the Quantopian. Quantopian lectures are not for Python version we use in the course. So there are some syntax differences.
So far we learnt basic concepts about the libraries with Numpy Introduction course. We will introduce another major library used for stastics, big data and similar works. We will use pandas with numpy together:
import numpy as np
import pandas as pd
Please add a command to show matplotlib
libraries. We will introduce matplotlib
library in the Data Visualization course later.
%matplotlib inline
Let's create a random.normal function to define a random dataset contains 100 values, with mean $0$ and standard deviation $ 0.5$.
dist = np.random.normal(0, 0.5, (100,))
print(dist.shape)
Using pandas
library we can convert dist
array into a Series
:
s = pd.Series(dist)
s.head() # First 5 element
s.head(10) # First 10 element
s.tail() # Last 5 element
s.tail(10) # Last 10 element
s.plot()
This is our Series data. Unlike the arrays Series have other options. Series
are more useful for data analysis compared to the arrays. Though they have different uses.
print(s.name)
We do not have a name for series. Let's define it:
s.name = "Stock returns"
print(s.name)
print(s.index)
Assume that this is a return data of a stock. So the Index range should be data, daily:
new_index = pd.date_range("2016-01-01", periods=len(s), freq="B")
2016-01-01
len(s)
B
s.index = new_index
We change the series index with new_index
created above. Let's plot it:
s.plot(title=s.name)
Well... We have a more proper plot compared to the first one. Please be careful about the options.
Assume these values are percentage, like 1%
. What happened at the end of the day? We need to define them as $s \times 0.01 + $ to see their true values. After this process we will find cumulative product:
s_cum = (s*0.01 + 1).cumprod()
s_cum.plot(title=s_cum.name)
It looks like within 100 days our stock price increased/decreased(percent):
print((s_cum[-1] - 1)*100)
iloc[start:end:step]
By default step=1
.
print("First element of the series: ", s.iloc[0])
print("Last element of the series: ", s.iloc[len(s)-1])
print("(or) Last element of the series: ", s.iloc[-1])
You can make slicing:
print(s.iloc[:10])
print(s.iloc[:100:10])
You can access series elements with index_name
:
print(s.loc['2016-01-01'])
print(s.loc['2016-01-01':'2016-01-10']) # Between dates
Which dates have negative returns?
print((s < 0).head())
How to access those series?
print(s[s < 0].head())
print(s[s > 1])
It looks like we have only series above that are greater than 1.
print(s[(s > 1) | (s < -1)])
Or simply:
print(s[np.abs(s) > 1])
print(s.mean()) # mean
print(s.std()) # standard deviation
print(s.max()) # max value
print(s.min()) # min value
print(s.median()) # median value
You can do all by just describe
function:
s.describe()
print(s.abs().head()) # Returns absolute values
new_series = pd.Series([1,5,np.nan,2,6,8,2,np.nan])
new_series.plot()
We can drop null values:
new_series = new_series.dropna()
new_series.plot()
We can carry many operations of the Series
within the pandas.DataFrame
dict_data = {
'a' : [1, 2, 3, 4, 5],
'b' : ['L', 'K', 'J', 'M', 'Z'],
'c' : np.random.normal(0, 1, 5)
}
print (dict_data)
df = pd.DataFrame(dict_data, index=pd.date_range('2016-01-01', periods=5))
print (df)
What happened here is that we define three series named a, b, and c
. They can converted to the pandas.Series
seperately instead we create a DataFrame
to combine all attributes. Each column in the DataFrame
are a pandas.Series
.
print(type(df['a']))
print(type(df['b']))
print(type(df['c']))
But together:
print(type(df))
You can define a DataFrame
in many ways:
You do not necessarly type the name of the column. You can just write down the column indexes. First column is:
df.iloc[:,0].head()
s_1 = pd.Series([2, 4, 6, 8, 10], name='Evens')
s_2 = pd.Series([1, 3, 5, 7, 9], name="Odds")
numbers = pd.concat([s_1, s_2], axis=1)
print(numbers)
Column names:
print(numbers.columns)
Change column names:
numbers.columns = ['Shmevens', 'Shmodds']
print(numbers)
Most of the attributes of the pandas.Series
are valid for DataFrames
:
print(numbers.index)
numbers.index = pd.date_range("2016-01-01", periods=len(numbers))
print(numbers)
print(numbers.values)
Let's create a similar DataFrame as done with series, randomly:
dist = np.random.normal(0, 0.5, (100,10))
df = pd.DataFrame(dist)
df.head()
df.index = pd.date_range("2016-01-01", periods=len(df.index), freq='B')
df.columns = ['Stck {}'.format(str(i)) for i in range(1,11)]
df.head()
Now we have 10 stock prices returns within one DataFrame. Let's calculate the returns as multiplication of previous day:
df_cum = df*0.01 + 1
df_cum = df_cum.cumprod()
df_cum.head()
df_cum.plot(figsize = (10,5))
As you see at the end of the day some stocks made profit and some made loss.
Locating the values works same with Series
. The last day profits are loses:
df_cum.iloc[-1]
BUT This is a two dimensional shape and you need two parameters to locate specific values.
df_cum.iloc[-1,2]
For example we can only care about profited stocks:
df_cum.iloc[-1]>1
You can simply define a column name:
df['new'] = 1
df.head()
Deleting a column is very easy:
del df['new']
Sorting by columns:
df.sort_values(by='Stck 2').head()
We will learn how to read files and save dataframe to the files. pandas can read all common databases. Command pd.read_
:
movies = pd.read_csv('movies.csv')
Be careful! Your notebook file and the movies.csv
file must be in the same folder. Or you must show the location of the file clearly.
movies.head()
movies.tail()
You can find this database from movieLens.
download movies.csv
We will do some operations with this dataset. For example title contains year of the movies. It would look nicer if we can seperate it and create a new column. We will use the Series.str
operations. Find the documentation.
You can do this operation with many different way. First way is to use str.extract
method to find years and define a new column:
movies['year'] = movies['title'].str.extract('(\d{4})')
movies.tail()
We use a ReGeX expressoion: (\d{4})
. Paranthesis represents the group. So we can create two different columns by extracting using two or more paranthesis. \d
represents the digit. \d{4}
expression together implies 4 consequtive numbers. You should check the regular expression documentation for further use.
Let's see how many NaN
years we have:
movies[movies['year'].isnull()]
Only four of them do not have the year. So if you are making data analysis you can manually fill this values by searching the internet. For now, I will drop them using dropna()
function:
movies = movies.dropna()
Be careful with the expression above. I defined movies after droping the null values: movies = movies.dropna()
. Let's check dataframe again:
movies.head()
We can remove the years from the title using str.replace function.
movies['title'] = movies['title'].str.replace('\(\d{4}\)','')
movies.head()
We will create new columns for genres wıth boolean values. It will do will take time. Follow carefully:
gen = movies['genres'].str.split('|')
gen_list = []
for i in gen:
for j in i:
gen_list.append(j)
Find the unique genres:
gen_list = set(gen_list) # set() function automatically removes the duplicates.
gen_list = list(gen_list)
print(len(gen_list))
gen_list=sorted(gen_list)
gen_list
We should drop the first value using pop()
function:
gen_list.pop(0)
It looks like we have 20 different genres. Let's define a new column for each of them:
for i in gen_list:
movies[i] = movies['genres'].str.contains(str(i))
movies.head()
Now we can drop the genres
column:
del movies['genres']
movies.head()
movies[gen_list].sum().sort_values()
movies[gen_list].sum().plot(kind = 'bar')
movies['year'] = pd.to_numeric(movies['year'])
movies['year'].plot()
Opps... It looks like we have some non-sense years inside our data. Let's trim the years later than 2018 and earlier than 1800:
movies = movies[((movies['year'] < 2018) & (movies['year'] > 1800))]
Well, we redefined the movies
excluding some years:
movies['year'].plot()
Pandas allows you to save your DataFrame in different types of files: csv, xlsx, html, latex, stata, sql, xarray, json, and so on.
movies.to_csv('movies_new.csv')
After you do this operation, a filed names movies_new.csv
will be created. You can read the same file by the command we used earlier.
In order to save an Excel file you need to create and ExcelWrite object first:
ex = pd.ExcelWriter('movies.xlsx')
movies.to_excel(ex, 'Sheet 1')
ex.save()
You have an excel file named movies.xlsx
and dataframe saved to the Sheet 1
.
import sqlite3
conn = sqlite3.connect('movies.db')
movies.to_sql('Movies', conn)
conn.close()
Now we saved our dataframe to a sql file.