Pandas Study Note1

"Hello World, Hello Blog"

Posted by Leonard Yuan on February 15, 2019

Pandas Study Note

1. About Anaconda

Anaconda is a high efficient Python platform, I just want to notice one of hte important issue: how to manage library.

  • list all libraries
conda list
  • install library
conda install xxx
  • Update all libraries
conda update --all 

2. Pandas Data Structure

Pandas has highe level data structures and functions than Numpy. Actually, Pandas is built on the top of Numpy. The primary data object of Pandas is DataFrame and Series.

a. Series

Series is like one demensional array, we can it like this:

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj = Series([2,3,5,6,7])
obj
obj.index
obj.values
array([2, 3, 5, 6, 7])

Also, you can set index for your Series, like:

obj = Series([2,3,4,5,6,7], index=['a','b','c','m','r','t'])
obj
a    2
b    3
c    4
m    5
r    6
t    7
dtype: int64
obj['a']
2

You can calculate on the Series directly, like blew:

obj[obj >5] # Get data which value is beyond 5
r    6
t    7
dtype: int64

Also, sometimes we can use dictionary to build Series, like:

test = {'a':12, 'b':234,'c':23}
obj3 = Series(test)
obj3
a     12
b    234
c     23
dtype: int64

Series can also be operated directly. In other words, Series can make calculation.

dict1 = {'a':21,'b':23,'c':23,'e':13}
dict2 = {'a':23,'b':22,'e':10}
Series(dict1) + Series(dict2)
a    44.0
b    45.0
c     NaN
e    23.0
dtype: float64

b. DataFrame

DataFrame is a kind of data structure in Pandas, which is similar with data table.

data = {'state':['ohio','Mass','CA','Nevada'],
        'year':[1990,1890,1920,1934],
        'pop':[1.2,3.4,4.5,6.7]
       }
frame = DataFrame(data)
frame
state year pop
0 ohio 1990 1.2
1 Mass 1890 3.4
2 CA 1920 4.5
3 Nevada 1934 6.7

Or, we can decide the sequence of the columns, like:

frame2 = DataFrame(data, columns=['state','pop','year'])
frame2
state pop year
0 ohio 1.2 1990
1 Mass 3.4 1890
2 CA 4.5 1920
3 Nevada 6.7 1934

If you want to set index, you can also operate like below:

frame3 = DataFrame(data,  columns=['state','pop','year','status'], index=['one','two','three','four'])
frame3
state pop year status
one ohio 1.2 1990 NaN
two Mass 3.4 1890 NaN
three CA 4.5 1920 NaN
four Nevada 6.7 1934 NaN
frame3['pop']
one      1.2
two      3.4
three    4.5
four     6.7
Name: pop, dtype: float64
frame3['state']['four']
'Nevada'

You can set value for entire column, like:

frame3['status'] = 'Pending'
frame3
state pop year status
one ohio 1.2 1990 Pending
two Mass 3.4 1890 Pending
three CA 4.5 1920 Pending
four Nevada 6.7 1934 Pending

Sometimes, you want to set value to some positions of a column, you can Series to achieve that, like:

statusC = Series(['good','bad'], index = ['one', 'three'])
frame3['status'] = statusC
frame3
state pop year status
one ohio 1.2 1990 good
two Mass 3.4 1890 NaN
three CA 4.5 1920 bad
four Nevada 6.7 1934 NaN

If you want to delete certain column, you can del to achieve that, like:

del frame3['status']
frame3
state pop year
one ohio 1.2 1990
two Mass 3.4 1890
three CA 4.5 1920
four Nevada 6.7 1934

We can get the transposition of a table, like:

frame3.T
one two three four
state ohio Mass CA Nevada
pop 1.2 3.4 4.5 6.7
year 1990 1890 1920 1934
frame3
state pop year
one ohio 1.2 1990
two Mass 3.4 1890
three CA 4.5 1920
four Nevada 6.7 1934

In Pandas, you can use index to add a row, like:

DataFrame(frame3, index = ['two','three','four','five'])
state pop year
two Mass 3.4 1890.0
three CA 4.5 1920.0
four Nevada 6.7 1934.0
five NaN NaN NaN

We can set the names of the columns and index, like:

frame3.index.name = 'Number'
frame3.columns.name = 'Column'
frame3
Column state pop year
Number
one ohio 1.2 1990
two Mass 3.4 1890
three CA 4.5 1920
four Nevada 6.7 1934

When we want to get the values of the DataFrame, we can operate like this:

frame3.values
array([['ohio', 1.2, 1990],
       ['Mass', 3.4, 1890],
       ['CA', 4.5, 1920],
       ['Nevada', 6.7, 1934]], dtype=object)

c.Index Object

This is another one useful data structure of Pandas, we can use this to convert the tags of Series or DataFrame into a Index, like below:

newobj = Series(range(4), index = ['a','b','c','d'])
index = newobj.index
index
Index(['a', 'b', 'c', 'd'], dtype='object')

Note: Index is static, so it cannot be reassigned any new value.

By using Index, different data sets can share same Index, like:

newobj2 = Series(range(2,6), index = index)
newobj2
a    2
b    3
c    4
d    5
dtype: int64

Sometimes, when we want to find whether some variables in columns or index, we can do it like this:

'state' in frame3.columns
True
'five' in frame3.index
False

There are some functions on Index.

Number Method Attribute
1 append Connect another Index into a new Index
2 diff Get the difference set of two Indexs
3 intersection Get the union set of two Indexs
4 delete(i) Delete key of position i, getting a new Index
5 is_unique When index do not repeat, return True

3. Basic Functions

a. reindex

This function can be used to reindex the Series, and return a new data set, like:

obj = Series([1,2,3,4], index = ['a','b','c','d'])
obj
a    1
b    2
c    3
d    4
dtype: int64
obj.reindex(['b','c','e','a','d'])
b    2.0
c    3.0
e    NaN
a    1.0
d    4.0
dtype: float64

We can there is a value NAN because of a new index, and we can set a default value in function reindex, so panadas can set a default value in empty place, like:

obj.reindex(['b','c','e','a','d'], fill_value=0)
b    0
c    0
e    0
a    0
d    0
dtype: int64

Sometimes, we should take insert operation on Series, we can also use reindex to achieve that, like: s

obj3 = Series(['blue','purple','yellow'], index = [0,2,4])
obj3
0      blue
2    purple
4    yellow
dtype: object
obj3.reindex(range(6), method='ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

There are two ways to fill empty places:

  • ffill/pad

  • bfill/backfill

For DataFrame, we can also use reindex to change the sequence of columns, indexs and empty places.

First of all, we can create a DataFrame object, it is about states datas, like:

stateframe = DataFrame(np.arange(9).reshape(3,3), index = ['a','b','c'], columns =['Ohio','Texas', 'California'])
stateframe
Ohio Texas California
a 0 1 2
b 3 4 5
c 6 7 8
stateframe2 = stateframe.reindex(columns = ['Ohio','Texas', 'Mass','California'])
stateframe2
Ohio Texas Mass California
a 0 1 NaN 2
b 3 4 NaN 5
c 6 7 NaN 8
stateframe2 = stateframe2.reindex(index = ['a','b','c','d'], method = 'ffill', columns = ['Ohio','Texas', 'Mass','California'])
stateframe2
Ohio Texas Mass California
a 0 1 NaN 2
b 3 4 NaN 5
c 6 7 NaN 8
d 6 7 NaN 8

If we think using reindex is over complicated, we can use ix ro achieve reindex operation, like:

stateframe2.ix[['a','c','b','d'],['Ohio','Texas', 'Mass','California']]
/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
Ohio Texas Mass California
a 0 1 NaN 2
c 6 7 NaN 8
b 3 4 NaN 5
d 6 7 NaN 8

b. Drop Data

When we want to delete some datas from data set, we can use function: drop( ).

data = DataFrame(np.arange(16).reshape(4,4), index = ['ohio','CA','MA','NYC'], columns = ['one','two','three','four'])
data
one two three four
ohio 0 1 2 3
CA 4 5 6 7
MA 8 9 10 11
NYC 12 13 14 15
data.drop('ohio')
one two three four
CA 4 5 6 7
MA 8 9 10 11
NYC 12 13 14 15
data.drop(['CA','NYC'])
one two three four
ohio 0 1 2 3
MA 8 9 10 11

If you want to delete a column from a data table, you should use “axis = 1” to sign that you will delete along column.

data.drop(['one'], axis = 1)
two three four
ohio 1 2 3
CA 5 6 7
MA 9 10 11
NYC 13 14 15
data.drop(['one','two'], axis = 1)
three four
ohio 2 3
CA 6 7
MA 10 11
NYC 14 15

c. Indexes, Selection and Filter

The mechanism of Series Indexes is similar to Numpy Indexes, but the indexes of Series is not limited as Integer type.I will show as below:

obj = Series(np.arange(4), index = ['a','b','c','d'])
obj
a    0
b    1
c    2
d    3
dtype: int64
obj[0:4]
a    0
b    1
c    2
d    3
dtype: int64
obj[obj<2]
a    0
b    1
dtype: int64
data
one two three four
ohio 0 1 2 3
CA 4 5 6 7
MA 8 9 10 11
NYC 12 13 14 15
data['one']
ohio     0
CA       4
MA       8
NYC     12
Name: one, dtype: int64
data.T['CA']
one      4
two      5
three    6
four     7
Name: CA, dtype: int64

We can also use slice operation on DataFrame, like:

data[:-2]
one two three four
ohio 0 1 2 3
CA 4 5 6 7

We can use condition judgement on DataFrame, then we can get the datas which match the requirement, like:

data[data['three'] > 6]
one two three four
MA 8 9 10 11
NYC 12 13 14 15

We can take operation on all position which matches the requirement, like:

data[data > 5] =0
data
one two three four
ohio 0 1 2 3
CA 4 5 0 0
MA 0 0 0 0
NYC 0 0 0 0
data = DataFrame(np.arange(16).reshape(4,4), index = ['ohio','CA','MA','NYC'], columns = ['one','two','three','four'])
data
one two three four
ohio 0 1 2 3
CA 4 5 6 7
MA 8 9 10 11
NYC 12 13 14 15

You can also use ix[ ] to acquire datas from data set, like:

data.ix['ohio', ['one','two']]
/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





one    0
two    1
Name: ohio, dtype: int64
data.ix[3]
/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





one      12
two      13
three    14
four     15
Name: NYC, dtype: int64
data.ix[data.three > 6, :3]
/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
one two three
MA 8 9 10
NYC 12 13 14

4. Arithmetic Operation and Alignment of Data

In Pandas, we can take arithmetic operation directly on objects, we can show as below:

s1 = Series(range(5), index = ['a','b','c','d','e'])
s2 = Series(range(6), index = ['b','d','e','c','m','f'])
s1+ s2
a    NaN
b    1.0
c    5.0
d    4.0
e    6.0
f    NaN
m    NaN
dtype: float64

For those no overlapping index, Pandas will use to NAN to fill those places.

Toward DataFrame, we can also take arithmetic operation on different DataFrames, like:

df1 = DataFrame(np.arange(9).reshape(3,3), columns = list('abc'), index = ['CA','MA','TS'])
df2 = DataFrame(np.arange(12).reshape(4,3), columns = list('bce'), index = ['CA','MA','TS', 'OH'])
df1 + df2
a b c e
CA NaN 1.0 3.0 NaN
MA NaN 7.0 9.0 NaN
OH NaN NaN NaN NaN
TS NaN 13.0 15.0 NaN

We can see that, for the empty places, Pandas will fill NAN in that, but sometimes, we should set certain default number on that, like:

df1.add(df2, fill_value = 0)
a b c e
CA 0.0 1.0 3.0 2.0
MA 3.0 7.0 9.0 5.0
OH NaN 9.0 10.0 11.0
TS 6.0 13.0 15.0 8.0

Here are four different functions on arithmetic operations:

  • add( )

  • sub( )

  • div( )

  • mul( )

df1.mul(df2, fill_value = 1)
a b c e
CA 0.0 0.0 2.0 2.0
MA 3.0 12.0 20.0 5.0
OH NaN 9.0 10.0 11.0
TS 6.0 42.0 56.0 8.0

a. Calculation Between DataFrame and Series

In Pandas library, we can take calculation between DataFrame and Series, like:

df3 = DataFrame(np.arange(12).reshape(4,3), columns = list('bce'), index = ['CA','MA','TS', 'OH'])
df3
b c e
CA 0 1 2
MA 3 4 5
TS 6 7 8
OH 9 10 11
newframe = df3.ix[0]
newframe
/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





b    0
c    1
e    2
Name: CA, dtype: int64
df3 - newframe
b c e
CA 0 0 0
MA 3 3 3
TS 6 6 6
OH 9 9 9
series = df3['b']
series
CA    0
MA    3
TS    6
OH    9
Name: b, dtype: int64
df3.sub(series, axis = 0)
b c e
CA 0 1 2
MA 0 1 2
TS 0 1 2
OH 0 1 2

5. Function Application and Mapping

First of all, we should build a data object, like:

frame = DataFrame(np.random.randn(4, 3), columns = list('bde'), index = ['Utah', 'Ohio','Texas', 'Oregon'])
frame
b d e
Utah 1.434118 -0.820673 0.610912
Ohio 0.572641 0.054709 -0.257323
Texas -2.293399 -0.407289 1.037199
Oregon -0.430910 -1.018978 -1.231238
np.abs(frame)
b d e
Utah 1.434118 0.820673 0.610912
Ohio 0.572641 0.054709 0.257323
Texas 2.293399 0.407289 1.037199
Oregon 0.430910 1.018978 1.231238

If we want to apply a function on a entire column or row, we can use function apply( ) to achieve, like:

f = lambda x: x.max() - x.min()
frame.apply(f)
b    3.727518
d    1.073687
e    2.268437
dtype: float64
frame.apply(f, axis = 1)
Utah      2.254791
Ohio      0.829963
Texas     3.330598
Oregon    0.800329
dtype: float64

We can also use function to return a Series, which contains muti-values:

def f(x):
    return Series([x.max(),x.min()], index = ['max', 'min'])
frame.apply(f)
b d e
max 1.434118 0.054709 1.037199
min -2.293399 -1.018978 -1.231238

If we want to operate all of elements in DataFrame, we can use function applymap to achieve that, like:

format = lambda x: '%.2f'% x
frame.applymap(format)
b d e
Utah 1.43 -0.82 0.61
Ohio 0.57 0.05 -0.26
Texas -2.29 -0.41 1.04
Oregon -0.43 -1.02 -1.23

There is a similar function in Series , we can also use that to operate data in a function, like:

frame['b'].map(format)
Utah       1.43
Ohio       0.57
Texas     -2.29
Oregon    -0.43
Name: b, dtype: object

6. Sorting and Ranking

If we want to sort on key set, we should use function sort_index( ):

  • For Series
obj = Series(range(4), index = ['a','c','b','d'])
obj.sort_index(ascending = False)
# obj.sort_index()
a    0
b    2
c    1
d    3
dtype: int64
  • For DataFrame
frame = DataFrame(np.arange(8).reshape(2, 4), index = ['three','one'], columns = ['a','c','b','d'])
frame
a c b d
three 0 1 2 3
one 4 5 6 7
frame.sort_index()
a c b d
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis = 1, ascending = False)
d c b a
three 3 1 2 0
one 7 5 6 4

If we want to sort on value set, we should use function order( ):

obj = Series([7, -1, 8, 5])
obj.sort_values()
# IN tutorial, there is a function named order(), but caonnot work in jupyter.
1   -1
3    5
0    7
2    8
dtype: int64

If there are some NAN values, they are actually sorted at last position, like:

obj2 = Series([4, np.nan, 7, np.nan, -1, 3, 12, np.nan])
obj2.sort_values()
4    -1.0
5     3.0
0     4.0
2     7.0
6    12.0
1     NaN
3     NaN
7     NaN
dtype: float64

In DataFrame, we can actually sort according to one or more variables, I will it as below:

frame = DataFrame({'a':[5,2,6,3,1], 'b':[-1,0,1,1,2]})
frame.sort_index(by = 'b')
/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)
a b
0 5 -1
1 2 0
2 6 1
3 3 1
4 1 2
frame.sort_index(by = ['b','a']) # ['a','b'] got different result with ['b','a']
/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)
  """Entry point for launching an IPython kernel.
a b
0 5 -1
1 2 0
3 3 1
2 6 1
4 1 2

In Pandas, we can also a function named ranking(), which can be used to generate the rankings:

obj = Series([7, -5, 7, 4,2, 0, 4])
obj.rank()
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

We can also get ranking according to the places of datas in data set.

obj.rank(method = 'first')
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

We can also get ranking according to descending order.

obj.rank(ascending = False, method='max')
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

Still, we can use this function on DataFrame data objects, like:

frame = DataFrame({'b':[4.3, 7, -3, 2], 'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
frame
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
frame.rank()
b a c
0 3.0 1.5 2.0
1 4.0 3.5 3.0
2 1.0 1.5 4.0
3 2.0 3.5 1.0
frame.rank(axis = 1)
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0

7. Axis Index With Repeating Values

In Series:

obj = Series(range(5), index = ['a','a','b','b','c'])
obj
a    0
a    1
b    2
b    3
c    4
dtype: int64
obj.index.is_unique
False
obj['a']
a    0
a    1
dtype: int64

When we index in DataFrame, we can also do similar operations on that:

df = DataFrame(np.random.randn(4, 3), index = ['a','a','b','b'])
df.ix['a']
/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
0 1 2
a -1.143057 -0.638386 1.094285
a 0.652737 -0.163082 -0.197243