Data Visualization

We covered some basic notions regarding with the plotting in the pandas course. We will cover matplotlib.pyplot for data visualization in Python.

Let's start with importing necessary libraries:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
x = np.arange(0, 5, 0.1);
y = np.sin(x)
plt.plot(x, y)
Out[2]:
[<matplotlib.lines.Line2D at 0x1aae44a9748>]

It looks like nothing appeared but a text. You should command it to plot:

In [3]:
plt.show()

If you don't want to bother with writing this each time for plotting you can use this command:

In [4]:
%matplotlib inline
From now on plots will be shown without `plt.show()` command.
In [5]:
X = np.linspace(-np.pi, np.pi, 256, endpoint=True) # generating 256 linear points between -pi and +pi
C, S = np.cos(X), np.sin(X)
plt.plot(X, C)
plt.plot(X, S)
Out[5]:
[<matplotlib.lines.Line2D at 0x1aae45640f0>]

We actually combined two plots of the sin() and cos() functions.

This plotting is done with the default settings. You can manipulate the features of the plot. Check for the matplotlib.pyplot documentation.

For example you can change the size of the plot:

In [6]:
plt.figure(figsize=(8,6), dpi=80)
plt.plot(X, C)
plt.plot(X, S)
Out[6]:
[<matplotlib.lines.Line2D at 0x1aae4728a58>]

You can change the colors:

In [7]:
plt.figure(figsize=(8,6), dpi=80)
plt.plot(X, C, color="blue", linewidth=1.0, linestyle="-")
plt.plot(X, S, color="green", linewidth=3.0, linestyle=":")
Out[7]:
[<matplotlib.lines.Line2D at 0x1aae4771c88>]
In [8]:
plt.figure(figsize=(8,6), dpi=80)
plt.plot(X, C, color="blue", linewidth=1.0, linestyle="-")
plt.plot(X, S, color="green", linewidth=3.0, linestyle=":")
plt.xlim(-4.0,4.0)
# Set x ticks
plt.xticks(np.linspace(-4,4,9,endpoint=True))
# Set y limits
plt.ylim(-1.0,1.0)
# Set y ticks
plt.yticks(np.linspace(-1,1,20,endpoint=True))
Out[8]:
([<matplotlib.axis.YTick at 0x1aae47ddbe0>,
  <matplotlib.axis.YTick at 0x1aae47dde80>,
  <matplotlib.axis.YTick at 0x1aae4810d68>,
  <matplotlib.axis.YTick at 0x1aae47ea0f0>,
  <matplotlib.axis.YTick at 0x1aae481d4a8>,
  <matplotlib.axis.YTick at 0x1aae48282b0>,
  <matplotlib.axis.YTick at 0x1aae4828940>,
  <matplotlib.axis.YTick at 0x1aae4828fd0>,
  <matplotlib.axis.YTick at 0x1aae482f6a0>,
  <matplotlib.axis.YTick at 0x1aae482fd30>,
  <matplotlib.axis.YTick at 0x1aae4835400>,
  <matplotlib.axis.YTick at 0x1aae4835a20>,
  <matplotlib.axis.YTick at 0x1aae483b0f0>,
  <matplotlib.axis.YTick at 0x1aae483b780>,
  <matplotlib.axis.YTick at 0x1aae483be10>,
  <matplotlib.axis.YTick at 0x1aae48414e0>,
  <matplotlib.axis.YTick at 0x1aae4841b70>,
  <matplotlib.axis.YTick at 0x1aae4847240>,
  <matplotlib.axis.YTick at 0x1aae48478d0>,
  <matplotlib.axis.YTick at 0x1aae4847f60>],
 <a list of 20 Text yticklabel objects>)

Let's add some grid:

In [9]:
plt.figure(figsize=(8,6), dpi=80)
plt.plot(X, C, color="blue", linewidth=3.0, linestyle="-")
plt.plot(X, S, color="green", linewidth=3.0, linestyle="-")
plt.xlim(-4.0,4.0)
# Set x ticks
plt.xticks(np.linspace(-4,4,9,endpoint=True))
# Set y limits
plt.ylim(-1.0,1.0)
# Set y ticks
plt.yticks(np.linspace(-1,1,20,endpoint=True))
plt.grid(color='red')

We can label the ticks:

In [10]:
plt.figure(figsize=(8,6), dpi=80)
plt.plot(X, C, color="blue", linewidth=3.0, linestyle="-")
plt.plot(X, S, color="green", linewidth=3.0, linestyle="-")
plt.xlim(-4.0,4.0)
# Set x ticks
# Set y limits
# Set y ticks
plt.grid(color='grey')
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi],
       [r'$-\pi$', r'$-\pi/2$', r'$0$', r'$+\pi/2$', r'$+\pi$'])

plt.yticks([-1, 0, +1],
       [r'$-1$', r'$0$', r'$+1$'])
Out[10]:
([<matplotlib.axis.YTick at 0x1aae494a630>,
  <matplotlib.axis.YTick at 0x1aae48ba8d0>,
  <matplotlib.axis.YTick at 0x1aae4969cc0>],
 <a list of 3 Text yticklabel objects>)

Withing the $ signs you can use latex/markdown formatting.

It look's like we can save the it as figure:

In [11]:
plt.figure(figsize=(8,6), dpi=80)
plt.plot(X, C, color="blue", linewidth=3.0, linestyle="-")
plt.plot(X, S, color="green", linewidth=3.0, linestyle="-")
plt.xlim(-4.0,4.0)
# Set x ticks
plt.xticks(np.linspace(-4,4,9,endpoint=True))
# Set y limits
plt.ylim(-1.0,1.0)
# Set y ticks
plt.yticks(np.linspace(-1,1,10,endpoint=True))
plt.grid(color='black')
plt.savefig('fig.png')

Add a legend:

In [12]:
plt.figure(figsize=(8,6), dpi=80)
plt.plot(X, C, color="blue", linewidth=3.0, linestyle="-", label = 'cos()')
plt.plot(X, S, color="green", linewidth=3.0, linestyle="-", label='sin()')
plt.xlim(-4.0,4.0)
plt.xticks(np.linspace(-4,4,9,endpoint=True))
plt.yticks(np.linspace(-1,1,10,endpoint=True))
plt.legend(loc='upper left', frameon=False)
Out[12]:
<matplotlib.legend.Legend at 0x1aae4a37da0>

Annotate some points:

In [13]:
plt.figure(figsize=(8,6), dpi=80)
plt.plot(X, C, color="blue", linewidth=3.0, linestyle="-", label = 'cos()')
plt.plot(X, S, color="green", linewidth=3.0, linestyle="-", label='sin()')
plt.xlim(-4.0,4.0)
plt.xticks(np.linspace(-4,4,9,endpoint=True))
plt.yticks(np.linspace(-1,1,10,endpoint=True))
plt.legend(loc='upper left', frameon=False)
plt.annotate(s='sin(0) = 0', xy = (0,0),
            xytext=(+10, -30), textcoords='offset points', fontsize=12,
            arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
Out[13]:
Text(10,-30,'sin(0) = 0')

Explanation of the annotate() attributes:

  • s: Text
  • xy: a tuple that gives the coordinates of the points.
  • xytext: coordinates of the text. In this case it means relative coordinates of the text respect to the point.
  • textcoords: have two options (1) points - offset (in points) from the xy value; (2) pixels - offset (in pixels) from the xy value
  • arrowprops: It gives attributes how to connect to the points.

For now you are not necessarily learn the details. It is enough to know what you can do while plotting a data.

Movies Database

Do you remember the dataset we manipulated in the pandas course? We actually saved it as csv file names movies_new.csv. Okay download the dataset and import it.

In [14]:
mv = pd.read_csv('movies_new.csv',encoding='latin1')
mv.head()
Out[14]:
Unnamed: 0 movieId title year Action Adventure Animation Children Comedy Crime ... Film-Noir Horror IMAX Musical Mystery Romance Sci-Fi Thriller War Western
0 0 1 Toy Story 1995 False True True True True False ... False False False False False False False False False False
1 1 2 Jumanji 1995 False True False True False False ... False False False False False False False False False False
2 2 3 Grumpier Old Men 1995 False False False False True False ... False False False False False True False False False False
3 3 4 Waiting to Exhale 1995 False False False False True False ... False False False False False True False False False False
4 4 5 Father of the Bride Part II 1995 False False False False True False ... False False False False False False False False False False

5 rows × 23 columns

Let's find out the categories:

In [15]:
cat_names = mv.columns
cat_names = cat_names[4:]
print(cat_names, '\n',len(cat_names))
Index(['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'],
      dtype='object') 
 19

We have 19 categories.

In [16]:
cat = mv[cat_names].sum().sort_values()
cat
Out[16]:
Film-Noir       133
IMAX            153
Western         168
War             367
Musical         393
Animation       446
Documentary     494
Mystery         542
Children        583
Fantasy         652
Sci-Fi          788
Horror          875
Crime          1099
Adventure      1115
Action         1543
Romance        1544
Thriller       1726
Comedy         3313
Drama          4359
dtype: int64

Let's make a histogram:

In [17]:
plt.figure(figsize=(12,5), dpi=80)
plt.bar(cat.index, cat.values, color='darkred')
plt.xticks(rotation=90, fontsize=12)
plt.xlabel('Categories',fontsize=15,color='darkblue')
plt.ylabel('Counts',fontsize=15,color='darkblue')
Out[17]:
Text(0,0.5,'Counts')

May be a pie chart works better? Also see this example

In [18]:
plt.figure(figsize=(10,10), dpi=80)
exp= [0] * 16 + [0.1] + [0] * 2
plt.pie(cat, explode = exp, labels=cat.index,shadow=True)
Out[18]:
([<matplotlib.patches.Wedge at 0x1aae5d79da0>,
  <matplotlib.patches.Wedge at 0x1aae5d82550>,
  <matplotlib.patches.Wedge at 0x1aae5d82cf8>,
  <matplotlib.patches.Wedge at 0x1aae5d8f550>,
  <matplotlib.patches.Wedge at 0x1aae5d8fd68>,
  <matplotlib.patches.Wedge at 0x1aae5d975c0>,
  <matplotlib.patches.Wedge at 0x1aae5d97dd8>,
  <matplotlib.patches.Wedge at 0x1aae5da1630>,
  <matplotlib.patches.Wedge at 0x1aae5da1e48>,
  <matplotlib.patches.Wedge at 0x1aae5daa6a0>,
  <matplotlib.patches.Wedge at 0x1aae5d60e80>,
  <matplotlib.patches.Wedge at 0x1aae5db46d8>,
  <matplotlib.patches.Wedge at 0x1aae5db4ef0>,
  <matplotlib.patches.Wedge at 0x1aae5dbc748>,
  <matplotlib.patches.Wedge at 0x1aae5dbcf60>,
  <matplotlib.patches.Wedge at 0x1aae5dc47b8>,
  <matplotlib.patches.Wedge at 0x1aae5dc4fd0>,
  <matplotlib.patches.Wedge at 0x1aae5dd1828>,
  <matplotlib.patches.Wedge at 0x1aae5ddb080>],
 [Text(1.09977,0.0226473,'Film-Noir'),
  Text(1.09769,0.0713027,'IMAX'),
  Text(1.09279,0.125741,'Western'),
  Text(1.07864,0.215716,'War'),
  Text(1.04586,0.340842,'Musical'),
  Text(0.992906,0.473433,'Animation'),
  Text(0.913758,0.61241,'Documentary'),
  Text(0.80423,0.750476,'Mystery'),
  Text(0.662018,0.878483,'Children'),
  Text(0.483018,0.988278,'Fantasy'),
  Text(0.25257,1.07061,'Sci-Fi'),
  Text(-0.0283506,1.09963,'Horror'),
  Text(-0.357877,1.04016,'Crime'),
  Text(-0.686637,0.859377,'Adventure'),
  Text(-0.97305,0.513004,'Action'),
  Text(-1.09997,0.00800369,'Romance'),
  Text(-1.0537,-0.574213,'Thriller'),
  Text(-0.316384,-1.05352,'Comedy'),
  Text(0.858899,-0.687236,'Drama')])

For better visualization we can cut-off some categories:

In [19]:
pop_cat = cat[-12:]
In [20]:
plt.figure(figsize=(12,12), dpi=80)
plt.pie(pop_cat, labels=pop_cat.index,shadow=True)
Out[20]:
([<matplotlib.patches.Wedge at 0x1aae5e1d8d0>,
  <matplotlib.patches.Wedge at 0x1aae5e290b8>,
  <matplotlib.patches.Wedge at 0x1aae5e298d0>,
  <matplotlib.patches.Wedge at 0x1aae5e33128>,
  <matplotlib.patches.Wedge at 0x1aae5e33940>,
  <matplotlib.patches.Wedge at 0x1aae5e3c198>,
  <matplotlib.patches.Wedge at 0x1aae5e3c9b0>,
  <matplotlib.patches.Wedge at 0x1aae5e44208>,
  <matplotlib.patches.Wedge at 0x1aae5e44a20>,
  <matplotlib.patches.Wedge at 0x1aae5e50278>,
  <matplotlib.patches.Wedge at 0x1aae5e0b198>,
  <matplotlib.patches.Wedge at 0x1aae5e572b0>],
 [Text(1.09516,0.103108,'Mystery'),
  Text(1.05447,0.313195,'Children'),
  Text(0.963959,0.529889,'Fantasy'),
  Text(0.803345,0.751423,'Sci-Fi'),
  Text(0.556805,0.948666,'Horror'),
  Text(0.206524,1.08044,'Crime'),
  Text(-0.212695,1.07924,'Adventure'),
  Text(-0.670022,0.872394,'Action'),
  Text(-1.02104,0.409241,'Romance'),
  Text(-1.0812,-0.202499,'Thriller'),
  Text(-0.539794,-0.958448,'Comedy'),
  Text(0.80113,-0.753785,'Drama')])

Second thing we can do with our data is to count number of movies for each year.

In [21]:
yrs = mv['year'].value_counts()
yrs.head()
Out[21]:
2000    279
1996    274
2002    271
1998    270
2001    267
Name: year, dtype: int64
In [22]:
yrs = yrs.sort_index()
In [23]:
plt.figure(figsize=(12,5), dpi=80)
plt.plot(yrs, linewidth=3.0, linestyle="-")
plt.xticks(rotation=90, fontsize=11)
plt.grid(color='black')
plt.xlabel('Years',fontsize=15,color='darkblue')
plt.ylabel('Counts',fontsize=15,color='darkblue')
Out[23]:
Text(0,0.5,'Counts')

Some Examples

When you examine the examples below you will notice that how advanced plotting can be done with matplotlib. These examples are taken from: https://www.labri.fr/perso/nrougier/teaching/matplotlib/

In [24]:
n = 1024
X = np.random.normal(0,1,n)
Y = np.random.normal(0,1,n)
T = np.arctan2(Y,X)

plt.axes([0.025,0.025,0.95,0.95])
plt.scatter(X,Y, s=75, c=T, alpha=.5)

plt.xlim(-1.5,1.5), plt.xticks([])
plt.ylim(-1.5,1.5), plt.yticks([])
# savefig('../figures/scatter_ex.png',dpi=48)
plt.show()
In [25]:
n = 12
X = np.arange(n)
Y1 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)
Y2 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)

plt.axes([0.025,0.025,0.95,0.95])
plt.bar(X, +Y1, facecolor='#9999ff', edgecolor='white')
plt.bar(X, -Y2, facecolor='#ff9999', edgecolor='white')

for x,y in zip(X,Y1):
    plt.text(x+0.4, y+0.05, '%.2f' % y, ha='center', va= 'bottom')

for x,y in zip(X,Y2):
    plt.text(x+0.4, -y-0.05, '%.2f' % y, ha='center', va= 'top')

plt.xlim(-.5,n), plt.xticks([])
plt.ylim(-1.25,+1.25), plt.yticks([])

# savefig('../figures/bar_ex.png', dpi=48)
plt.show()
In [26]:
def f(x,y):
    return (1-x/2+x**5+y**3)*np.exp(-x**2-y**2)

n = 256
x = np.linspace(-3,3,n)
y = np.linspace(-3,3,n)
X,Y = np.meshgrid(x,y)

plt.axes([0.025,0.025,0.95,0.95])

plt.contourf(X, Y, f(X,Y), 8, alpha=.75, cmap=plt.cm.hot)
C = plt.contour(X, Y, f(X,Y), 8, colors='black', linewidth=.5)
plt.clabel(C, inline=1, fontsize=10)

plt.xticks([]), plt.yticks([])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\contour.py:967: UserWarning: The following kwargs were not used by contour: 'linewidth'
  s)
In [27]:
n = 20
Z = np.ones(n)
Z[-1] *= 2

plt.axes([0.025,0.025,0.95,0.95])

plt.pie(Z, explode=Z*.05, colors = ['%f' % (i/float(n)) for i in range(n)])
plt.gca().set_aspect('equal')
plt.xticks([]), plt.yticks([])
Out[27]:
(([], <a list of 0 Text xticklabel objects>),
 ([], <a list of 0 Text yticklabel objects>))
In [28]:
n = 8
X,Y = np.mgrid[0:n,0:n]
T = np.arctan2(Y-n/2.0, X-n/2.0)
R = 10+np.sqrt((Y-n/2.0)**2+(X-n/2.0)**2)
U,V = R*np.cos(T), R*np.sin(T)

plt.axes([0.025,0.025,0.95,0.95])
plt.quiver(X,Y,U,V,R, alpha=.5)
plt.quiver(X,Y,U,V, edgecolor='k', facecolor='None', linewidth=.5)

plt.xlim(-1,n), plt.xticks([])
plt.ylim(-1,n), plt.yticks([])
Out[28]:
((-1, 8), ([], <a list of 0 Text yticklabel objects>))
In [29]:
ax = plt.axes([0.025,0.025,0.95,0.95], polar=True)

N = 20
theta = np.arange(0.0, 2*np.pi, 2*np.pi/N)
radii = 10*np.random.rand(N)
width = np.pi/4*np.random.rand(N)
bars = plt.bar(theta, radii, width=width, bottom=0.0)

for r,bar in zip(radii, bars):
    bar.set_facecolor( plt.cm.jet(r/10.))
    bar.set_alpha(0.5)

ax.set_xticklabels([])
ax.set_yticklabels([])
# savefig('../figures/polar_ex.png',dpi=48)
plt.show()
In [30]:
eqs = []
eqs.append((r"$W^{3\beta}_{\delta_1 \rho_1 \sigma_2} = U^{3\beta}_{\delta_1 \rho_1} + \frac{1}{8 \pi 2} \int^{\alpha_2}_{\alpha_2} d \alpha^\prime_2 \left[\frac{ U^{2\beta}_{\delta_1 \rho_1} - \alpha^\prime_2U^{1\beta}_{\rho_1 \sigma_2} }{U^{0\beta}_{\rho_1 \sigma_2}}\right]$"))
eqs.append((r"$\frac{d\rho}{d t} + \rho \vec{v}\cdot\nabla\vec{v} = -\nabla p + \mu\nabla^2 \vec{v} + \rho \vec{g}$"))
eqs.append((r"$\int_{-\infty}^\infty e^{-x^2}dx=\sqrt{\pi}$"))
eqs.append((r"$E = mc^2 = \sqrt{{m_0}^2c^4 + p^2c^2}$"))
eqs.append((r"$F_G = G\frac{m_1m_2}{r^2}$"))


plt.axes([0.025,0.025,0.95,0.95])

for i in range(24):
    index = np.random.randint(0,len(eqs))
    eq = eqs[index]
    size = np.random.uniform(12,32)
    x,y = np.random.uniform(0,1,2)
    alpha = np.random.uniform(0.25,.75)
    plt.text(x, y, eq, ha='center', va='center', color="#11557c", alpha=alpha,
             transform=plt.gca().transAxes, fontsize=size, clip_on=True)

plt.xticks([]), plt.yticks([])
# savefig('../figures/text_ex.png',dpi=48)
plt.show()

Matplotlib is very fundemental plotting library. Many of the plotting libraries are built on Matplotlib. You can check out for other libraries. Let's check out another library.

In [31]:
import seaborn as sns
In [32]:
tips = sns.load_dataset("tips")
sns.violinplot(x = "total_bill", data=tips)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1aae80af710>

seaborn is based on matplotlib so we can use matplotlib features with seaborn. For instance:

In [33]:
plt.figure(figsize=(8,4), dpi=80)
tips = sns.load_dataset("tips")
sns.violinplot(x = "total_bill", data=tips)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1aae80af080>
In [34]:
plt.figure(figsize=(7,7), dpi=80)
# Load iris data
iris = sns.load_dataset("iris")
# Construct iris plot
sns.swarmplot(x="species", y="petal_length", data=iris)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1aae80ba860>
In [35]:
titanic = sns.load_dataset("titanic")

# Set up a factorplot
g = sns.factorplot("class", "survived", "sex", data=titanic, kind="bar", palette="muted", legend=False)
 

seaborn default styles: white, dark, whitegrid, darkgrid, ticks.

In [36]:
sns.set() #default seaborn style
titanic = sns.load_dataset("titanic")
g = sns.factorplot("class", "survived", "sex", data=titanic, kind="bar", palette="muted", legend=False)
In [37]:
sns.set_style('whitegrid') #default seaborn style
titanic = sns.load_dataset("titanic")
g = sns.factorplot("class", "survived", "sex", data=titanic, kind="bar", palette="muted", legend=False)
In [38]:
sns.set_style('dark') #default seaborn style
titanic = sns.load_dataset("titanic")
g = sns.factorplot("class", "survived", "sex", data=titanic, kind="bar", palette="muted", legend=False)

Objective of this course is to introduce basic plotting library of python. One importing thing should be pointed out that you do not need to grasp every detail of this features. This course should give you the intuition of how to plot data. You must get habit of learning what you need using documentations, or using Google :).

Check out for other data visualization libraries: 10 Useful Python Data Visualization Libraries for Any Discipline