Pandas Study Note

1. About Anaconda

Anaconda is a high efficient Python platform, I just want to notice one of hte important issue: how to manage library.

list all libraries

conda list

install library

conda install xxx

Update all libraries

conda update --all 

2. Pandas Data Structure

Pandas has highe level data structures and functions than Numpy. Actually, Pandas is built on the top of Numpy. The primary data object of Pandas is DataFrame and Series.

a. Series

Series is like one demensional array, we can it like this:

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj = Series([2,3,5,6,7])

obj
obj.index
obj.values

array([2, 3, 5, 6, 7])

Also, you can set index for your Series, like:

obj = Series([2,3,4,5,6,7], index=['a','b','c','m','r','t'])

obj

a    2
b    3
c    4
m    5
r    6
t    7
dtype: int64

obj['a']

You can calculate on the Series directly, like blew:

obj[obj >5] # Get data which value is beyond 5

r    6
t    7
dtype: int64

Also, sometimes we can use dictionary to build Series, like:

test = {'a':12, 'b':234,'c':23}
obj3 = Series(test)

obj3

a     12
b    234
c     23
dtype: int64

Series can also be operated directly. In other words, Series can make calculation.

dict1 = {'a':21,'b':23,'c':23,'e':13}
dict2 = {'a':23,'b':22,'e':10}
Series(dict1) + Series(dict2)

a    44.0
b    45.0
c     NaN
e    23.0
dtype: float64

b. DataFrame

DataFrame is a kind of data structure in Pandas, which is similar with data table.

data = {'state':['ohio','Mass','CA','Nevada'],
        'year':[1990,1890,1920,1934],
        'pop':[1.2,3.4,4.5,6.7]
       }
frame = DataFrame(data)
frame

	state	year	pop
0	ohio	1990	1.2
1	Mass	1890	3.4
2	CA	1920	4.5
3	Nevada	1934	6.7

Or, we can decide the sequence of the columns, like:

frame2 = DataFrame(data, columns=['state','pop','year'])
frame2

	state	pop	year
0	ohio	1.2	1990
1	Mass	3.4	1890
2	CA	4.5	1920
3	Nevada	6.7	1934

If you want to set index, you can also operate like below:

frame3 = DataFrame(data,  columns=['state','pop','year','status'], index=['one','two','three','four'])
frame3

	state	pop	year	status
one	ohio	1.2	1990	NaN
two	Mass	3.4	1890	NaN
three	CA	4.5	1920	NaN
four	Nevada	6.7	1934	NaN

frame3['pop']

one      1.2
two      3.4
three    4.5
four     6.7
Name: pop, dtype: float64

frame3['state']['four']

'Nevada'

You can set value for entire column, like:

frame3['status'] = 'Pending'
frame3

	state	pop	year	status
one	ohio	1.2	1990	Pending
two	Mass	3.4	1890	Pending
three	CA	4.5	1920	Pending
four	Nevada	6.7	1934	Pending

Sometimes, you want to set value to some positions of a column, you can Series to achieve that, like:

statusC = Series(['good','bad'], index = ['one', 'three'])
frame3['status'] = statusC
frame3

	state	pop	year	status
one	ohio	1.2	1990	good
two	Mass	3.4	1890	NaN
three	CA	4.5	1920	bad
four	Nevada	6.7	1934	NaN

If you want to delete certain column, you can del to achieve that, like:

del frame3['status']
frame3

	state	pop	year
one	ohio	1.2	1990
two	Mass	3.4	1890
three	CA	4.5	1920
four	Nevada	6.7	1934

We can get the transposition of a table, like:

frame3.T

	one	two	three	four
state	ohio	Mass	CA	Nevada
pop	1.2	3.4	4.5	6.7
year	1990	1890	1920	1934

frame3

	state	pop	year
one	ohio	1.2	1990
two	Mass	3.4	1890
three	CA	4.5	1920
four	Nevada	6.7	1934

In Pandas, you can use index to add a row, like:

DataFrame(frame3, index = ['two','three','four','five'])

	state	pop	year
two	Mass	3.4	1890.0
three	CA	4.5	1920.0
four	Nevada	6.7	1934.0
five	NaN	NaN	NaN

We can set the names of the columns and index, like:

frame3.index.name = 'Number'
frame3.columns.name = 'Column'
frame3

Column	state	pop	year
Number
one	ohio	1.2	1990
two	Mass	3.4	1890
three	CA	4.5	1920
four	Nevada	6.7	1934

When we want to get the values of the DataFrame, we can operate like this:

frame3.values

array([['ohio', 1.2, 1990],
       ['Mass', 3.4, 1890],
       ['CA', 4.5, 1920],
       ['Nevada', 6.7, 1934]], dtype=object)

c.Index Object

This is another one useful data structure of Pandas, we can use this to convert the tags of Series or DataFrame into a Index, like below:

newobj = Series(range(4), index = ['a','b','c','d'])
index = newobj.index
index

Index(['a', 'b', 'c', 'd'], dtype='object')

Note: Index is static, so it cannot be reassigned any new value.

By using Index, different data sets can share same Index, like:

newobj2 = Series(range(2,6), index = index)
newobj2

a    2
b    3
c    4
d    5
dtype: int64

Sometimes, when we want to find whether some variables in columns or index, we can do it like this:

'state' in frame3.columns

True

'five' in frame3.index

False

There are some functions on Index.

Number	Method	Attribute
1	append	Connect another Index into a new Index
2	diff	Get the difference set of two Indexs
3	intersection	Get the union set of two Indexs
4	delete(i)	Delete key of position i, getting a new Index
5	is_unique	When index do not repeat, return True

3. Basic Functions

a. reindex

This function can be used to reindex the Series, and return a new data set, like:

obj = Series([1,2,3,4], index = ['a','b','c','d'])
obj

a    1
b    2
c    3
d    4
dtype: int64

obj.reindex(['b','c','e','a','d'])

b    2.0
c    3.0
e    NaN
a    1.0
d    4.0
dtype: float64

We can there is a value NAN because of a new index, and we can set a default value in function reindex, so panadas can set a default value in empty place, like:

obj.reindex(['b','c','e','a','d'], fill_value=0)

b    0
c    0
e    0
a    0
d    0
dtype: int64

Sometimes, we should take insert operation on Series, we can also use reindex to achieve that, like: s

obj3 = Series(['blue','purple','yellow'], index = [0,2,4])
obj3

    blue
  purple
  yellow
dtype: object

obj3.reindex(range(6), method='ffill')

    blue
    blue
  purple
  purple
  yellow
  yellow
dtype: object

There are two ways to fill empty places:

ffill/pad
bfill/backfill

For DataFrame, we can also use reindex to change the sequence of columns, indexs and empty places.

First of all, we can create a DataFrame object, it is about states datas, like:

stateframe = DataFrame(np.arange(9).reshape(3,3), index = ['a','b','c'], columns =['Ohio','Texas', 'California'])
stateframe

	Ohio	Texas	California
a	0	1	2
b	3	4	5
c	6	7	8

stateframe2 = stateframe.reindex(columns = ['Ohio','Texas', 'Mass','California'])
stateframe2

	Ohio	Texas	Mass	California
a	0	1	NaN	2
b	3	4	NaN	5
c	6	7	NaN	8

stateframe2 = stateframe2.reindex(index = ['a','b','c','d'], method = 'ffill', columns = ['Ohio','Texas', 'Mass','California'])
stateframe2

	Ohio	Texas	Mass	California
a	0	1	NaN	2
b	3	4	NaN	5
c	6	7	NaN	8
d	6	7	NaN	8

If we think using reindex is over complicated, we can use ix ro achieve reindex operation, like:

stateframe2.ix[['a','c','b','d'],['Ohio','Texas', 'Mass','California']]

/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

	Ohio	Texas	Mass	California
a	0	1	NaN	2
c	6	7	NaN	8
b	3	4	NaN	5
d	6	7	NaN	8

b. Drop Data

When we want to delete some datas from data set, we can use function: drop( ).

data = DataFrame(np.arange(16).reshape(4,4), index = ['ohio','CA','MA','NYC'], columns = ['one','two','three','four'])
data

	one	two	three	four
ohio	0	1	2	3
CA	4	5	6	7
MA	8	9	10	11
NYC	12	13	14	15

data.drop('ohio')

	one	two	three	four
CA	4	5	6	7
MA	8	9	10	11
NYC	12	13	14	15

data.drop(['CA','NYC'])

	one	two	three	four
ohio	0	1	2	3
MA	8	9	10	11

If you want to delete a column from a data table, you should use “axis = 1” to sign that you will delete along column.

data.drop(['one'], axis = 1)

	two	three	four
ohio	1	2	3
CA	5	6	7
MA	9	10	11
NYC	13	14	15

data.drop(['one','two'], axis = 1)

	three	four
ohio	2	3
CA	6	7
MA	10	11
NYC	14	15

c. Indexes, Selection and Filter

The mechanism of Series Indexes is similar to Numpy Indexes, but the indexes of Series is not limited as Integer type.I will show as below:

obj = Series(np.arange(4), index = ['a','b','c','d'])
obj

a    0
b    1
c    2
d    3
dtype: int64

obj[0:4]

a    0
b    1
c    2
d    3
dtype: int64

obj[obj<2]

a    0
b    1
dtype: int64

data

	one	two	three	four
ohio	0	1	2	3
CA	4	5	6	7
MA	8	9	10	11
NYC	12	13	14	15

data['one']

ohio     0
CA       4
MA       8
NYC     12
Name: one, dtype: int64

data.T['CA']

one      4
two      5
three    6
four     7
Name: CA, dtype: int64

We can also use slice operation on DataFrame, like:

data[:-2]

	one	two	three	four
ohio	0	1	2	3
CA	4	5	6	7

We can use condition judgement on DataFrame, then we can get the datas which match the requirement, like:

data[data['three'] > 6]

	one	two	three	four
MA	8	9	10	11
NYC	12	13	14	15

We can take operation on all position which matches the requirement, like:

data[data > 5] =0
data

	one	two	three	four
ohio	0	1	2	3
CA	4	5	0	0
MA	0	0	0	0
NYC	0	0	0	0

data = DataFrame(np.arange(16).reshape(4,4), index = ['ohio','CA','MA','NYC'], columns = ['one','two','three','four'])
data

	one	two	three	four
ohio	0	1	2	3
CA	4	5	6	7
MA	8	9	10	11
NYC	12	13	14	15

You can also use ix[ ] to acquire datas from data set, like:

data.ix['ohio', ['one','two']]

/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





one    0
two    1
Name: ohio, dtype: int64

data.ix[3]

/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





one      12
two      13
three    14
four     15
Name: NYC, dtype: int64

data.ix[data.three > 6, :3]

/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

	one	two	three
MA	8	9	10
NYC	12	13	14

4. Arithmetic Operation and Alignment of Data

In Pandas, we can take arithmetic operation directly on objects, we can show as below:

s1 = Series(range(5), index = ['a','b','c','d','e'])
s2 = Series(range(6), index = ['b','d','e','c','m','f'])
s1+ s2

a    NaN
b    1.0
c    5.0
d    4.0
e    6.0
f    NaN
m    NaN
dtype: float64

For those no overlapping index, Pandas will use to NAN to fill those places.

Toward DataFrame, we can also take arithmetic operation on different DataFrames, like:

df1 = DataFrame(np.arange(9).reshape(3,3), columns = list('abc'), index = ['CA','MA','TS'])
df2 = DataFrame(np.arange(12).reshape(4,3), columns = list('bce'), index = ['CA','MA','TS', 'OH'])
df1 + df2

	a	b	c	e
CA	NaN	1.0	3.0	NaN
MA	NaN	7.0	9.0	NaN
OH	NaN	NaN	NaN	NaN
TS	NaN	13.0	15.0	NaN

We can see that, for the empty places, Pandas will fill NAN in that, but sometimes, we should set certain default number on that, like:

df1.add(df2, fill_value = 0)

	a	b	c	e
CA	0.0	1.0	3.0	2.0
MA	3.0	7.0	9.0	5.0
OH	NaN	9.0	10.0	11.0
TS	6.0	13.0	15.0	8.0

Here are four different functions on arithmetic operations:

add( )
sub( )
div( )
mul( )

df1.mul(df2, fill_value = 1)

	a	b	c	e
CA	0.0	0.0	2.0	2.0
MA	3.0	12.0	20.0	5.0
OH	NaN	9.0	10.0	11.0
TS	6.0	42.0	56.0	8.0

a. Calculation Between DataFrame and Series

In Pandas library, we can take calculation between DataFrame and Series, like:

df3 = DataFrame(np.arange(12).reshape(4,3), columns = list('bce'), index = ['CA','MA','TS', 'OH'])
df3

	b	c	e
CA	0	1	2
MA	3	4	5
TS	6	7	8
OH	9	10	11

newframe = df3.ix[0]
newframe

/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





b    0
c    1
e    2
Name: CA, dtype: int64

df3 - newframe

	b	c	e
CA	0	0	0
MA	3	3	3
TS	6	6	6
OH	9	9	9

series = df3['b']
series

CA    0
MA    3
TS    6
OH    9
Name: b, dtype: int64

df3.sub(series, axis = 0)

	c	e
CA	1	2
MA	1	2
TS	1	2
OH	1	2

5. Function Application and Mapping

First of all, we should build a data object, like:

frame = DataFrame(np.random.randn(4, 3), columns = list('bde'), index = ['Utah', 'Ohio','Texas', 'Oregon'])

frame

	b	d	e
Utah	1.434118	-0.820673	0.610912
Ohio	0.572641	0.054709	-0.257323
Texas	-2.293399	-0.407289	1.037199
Oregon	-0.430910	-1.018978	-1.231238

np.abs(frame)

	b	d	e
Utah	1.434118	0.820673	0.610912
Ohio	0.572641	0.054709	0.257323
Texas	2.293399	0.407289	1.037199
Oregon	0.430910	1.018978	1.231238

If we want to apply a function on a entire column or row, we can use function apply( ) to achieve, like:

f = lambda x: x.max() - x.min()

frame.apply(f)

b    3.727518
d    1.073687
e    2.268437
dtype: float64

frame.apply(f, axis = 1)

Utah      2.254791
Ohio      0.829963
Texas     3.330598
Oregon    0.800329
dtype: float64

We can also use function to return a Series, which contains muti-values:

def f(x):
    return Series([x.max(),x.min()], index = ['max', 'min'])
frame.apply(f)

	b	d	e
max	1.434118	0.054709	1.037199
min	-2.293399	-1.018978	-1.231238

If we want to operate all of elements in DataFrame, we can use function applymap to achieve that, like:

format = lambda x: '%.2f'% x

frame.applymap(format)

	b	d	e
Utah	1.43	-0.82	0.61
Ohio	0.57	0.05	-0.26
Texas	-2.29	-0.41	1.04
Oregon	-0.43	-1.02	-1.23

There is a similar function in Series , we can also use that to operate data in a function, like:

frame['b'].map(format)

Utah       1.43
Ohio       0.57
Texas     -2.29
Oregon    -0.43
Name: b, dtype: object

6. Sorting and Ranking

If we want to sort on key set, we should use function sort_index( ):

For Series

obj = Series(range(4), index = ['a','c','b','d'])
obj.sort_index(ascending = False)
# obj.sort_index()

a    0
b    2
c    1
d    3
dtype: int64

For DataFrame

frame = DataFrame(np.arange(8).reshape(2, 4), index = ['three','one'], columns = ['a','c','b','d'])
frame

	a	c	b	d
three	0	1	2	3
one	4	5	6	7

frame.sort_index()

	a	c	b	d
one	4	5	6	7
three	0	1	2	3

frame.sort_index(axis = 1, ascending = False)

	d	c	b	a
three	3	1	2	0
one	7	5	6	4

If we want to sort on value set, we should use function order( ):

obj = Series([7, -1, 8, 5])
obj.sort_values()
# IN tutorial, there is a function named order(), but caonnot work in jupyter.

 -1
  5
  7
  8
dtype: int64

If there are some NAN values, they are actually sorted at last position, like:

obj2 = Series([4, np.nan, 7, np.nan, -1, 3, 12, np.nan])
obj2.sort_values()

  -1.0
   3.0
   4.0
   7.0
  12.0
   NaN
   NaN
   NaN
dtype: float64

In DataFrame, we can actually sort according to one or more variables, I will it as below:

frame = DataFrame({'a':[5,2,6,3,1], 'b':[-1,0,1,1,2]})
frame.sort_index(by = 'b')

/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)

	a	b
0	5	-1
1	2	0
2	6	1
3	3	1
4	1	2

frame.sort_index(by = ['b','a']) # ['a','b'] got different result with ['b','a']

/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)
  """Entry point for launching an IPython kernel.

	a	b
0	5	-1
1	2	0
3	3	1
2	6	1
4	1	2

In Pandas, we can also a function named ranking(), which can be used to generate the rankings:

obj = Series([7, -5, 7, 4,2, 0, 4])
obj.rank()

  6.5
  1.0
  6.5
  4.5
  3.0
  2.0
  4.5
dtype: float64

We can also get ranking according to the places of datas in data set.

obj.rank(method = 'first')

  6.0
  1.0
  7.0
  4.0
  3.0
  2.0
  5.0
dtype: float64

We can also get ranking according to descending order.

obj.rank(ascending = False, method='max')

  2.0
  7.0
  2.0
  4.0
  5.0
  6.0
  4.0
dtype: float64

Still, we can use this function on DataFrame data objects, like:

frame = DataFrame({'b':[4.3, 7, -3, 2], 'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
frame

	b	a	c
0	4.3	0	-2.0
1	7.0	1	5.0
2	-3.0	0	8.0
3	2.0	1	-2.5

frame.rank()

	b	a	c
0	3.0	1.5	2.0
1	4.0	3.5	3.0
2	1.0	1.5	4.0
3	2.0	3.5	1.0

frame.rank(axis = 1)

	b	a	c
0	3.0	2.0	1.0
1	3.0	1.0	2.0
2	1.0	2.0	3.0
3	3.0	2.0	1.0

7. Axis Index With Repeating Values

In Series:

obj = Series(range(5), index = ['a','a','b','b','c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

obj.index.is_unique

False

obj['a']

a    0
a    1
dtype: int64

When we index in DataFrame, we can also do similar operations on that:

df = DataFrame(np.random.randn(4, 3), index = ['a','a','b','b'])
df.ix['a']

/Users/leonardyuan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

	0	1	2
a	-1.143057	-0.638386	1.094285
a	0.652737	-0.163082	-0.197243

Pandas Study Note1

"Hello World, Hello Blog"