Data Science Study Note1
Prerequsite: As we all know that data science is heat in recent year, thus we all should learn knowledge and skills about data analysis and data mining.
- First we should study knowledge about Numpy, Pandas, SciPy and SciKit-learn.
- Then, we start learning about Tensorflow, Keras, and Torch.
- Also, we need be familar with basic math concepts like probability, linear algebra. So, let’s start it.
1. Basics of Python
I skip some terms of this chapter beacasue I have used Python before, hence no need to repeat them again. I just share tips about Python.
Python has data set formats like these: set, list, dictionary, tuple
- About zip Operation
a = range(10)
b = range(10, 20)
c = zip(a, b)
c
<zip at 0x10b91af08>
for m in c:
print(m)
(0, 10)
(1, 11)
(2, 12)
(3, 13)
(4, 14)
(5, 15)
(6, 16)
(7, 17)
(8, 18)
(9, 19)
- List comprehensions
This is a kind of useful tools to deal with data set, like this:
x = [1,23,32,8,33,97,123]
result = [m if m <30 else 0 for m in x]
result
[1, 23, 0, 8, 0, 0, 0]
- all and any Operation
This is an useful method to judge whether the elements in a data set match the requirements or not.
conda = [item < 30 for item in x]
all(conda)
False
any(conda)
True
we can check the value of variables like this:
%whos
Variable Type Data/Info
-----------------------------
a range range(0, 10)
b range range(10, 20)
c zip <zip object at 0x10b91af08>
conda list n=7
m tuple n=2
result list n=7
x list n=7
- Generate Data Set
Note: In Python, funtion can return multi-results
def cal(x, y):
return (x + y), (x*y), (x/y)
a,b,c = cal(10, 5)
a
15
b
50
c
2.0
- Counter Collection
This is an useful collection, it inherited from collections library, that can be used to count the numbers of elements in a data set, like this:
x = [1,20, 10, 2, 20, 2, 2, 20]
from collections import Counter
result = Counter(x)
for key in result:
print(str(key) + " : " + str(result[key]))
1 : 1
20 : 3
10 : 1
2 : 3
- Generators
Actually, I do not often use that, but I must admit that this is useful tool to generate data. It is shown as follows:
def generate_odd(n):
i = 2
while i < n:
yield i
i+=2
generate_20 = generate_odd(20)
for m in generate_20:
print(m)
2
4
6
8
10
12
14
16
18
- How to use *args and **kwargs
def cal(*args, **kwargs):
sum = 0
for m in args:
sum +=m
for n in kwargs:
sum+=kwargs[n]
return sum
cal(1,2,3,5,k=12,m=34)
57
2. Numpy
Numpy is one of the efficient Library to operate datas, I have studied this before, and I have a study note on my blog. So, I just skip this.
Here is a simple example as follows:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(0,5,500)
y = np.cos(x)
mplot = plt.plot(x,y)
x = np.linspace(0, 10, 1000)
y = np.sin(x)
mplot = plt.plot(x, y)