导图社区 Python进阶:数据分析及检测
学习路径:讯飞AI大学《Python数据分类整合系列课》《Python数据分析之金融欺诈行为检测》 转载请注明出处 MindMaster @Tracy
编辑于2020-07-25 17:49:01Python进阶:数据分析及检测
1. 数据收集
爬虫等
import pandas as pd
df = pd.DataFrame({"col1": ['item1', 'item2', ...], "col2": [...], ...})
df = pd.read_csv("filename.csv", index_col=0, parse_dates = ["col1"])
2. 数据处理
pandas数据分类
group by
gp = df.groupby( )
#DataFrameGroupBy object
#rows of same value in col1 will be grouped
"col1"
"col1", sort=False
["col1", "col2"]
lambda x: x.year
def key(x): return x.year
gp.get_group("col1 val1")
#DataFrame
gp.groups
#Dictionary:{'row1': Int64Index( [?], dtype='?'), ... }
aggregate
calculations
#DataFrame
gp.sum()
gp.aggregate(sum)
gp.agg(sum)
gp.agg(np.sum)
gp[["col3", "col4"]].sum()
gp.size()
#how many records in each group
gp.mean()
gp.median()
gp.std()
#standard deviation
gp.describe()
#all functions above
gp.agg([np.mean, np.std, np.sum, ...])
gp.first()
gp.last()
gp.min()
gp.max()
len(gp)
#no. of rows
iterate
for col1, group in gp: ...
#each group is a DataFrame
transform
df.transform( )
lambda x: (x - x.mean()) / x.std()
join
pd.DataFrame({ "Original Adj Close": df["Adj Close"], "Transformed Adj Close": transformed["Adj Close"] })
index
df.index
#e.g. DataTimeIndex(['date1', 'date2', ... ])
[i]
#Timestamp('YYYY-MM-DD hh:mm:ss')
.year
#YYYY
.month
.weekday()
#1~7
df.index = pd.PeriodIndex( index, freq=? )
index
[str(i[0]) + "/" + str(i[1]) for i in df.index.values]
#.index.values: list [(yyyy, m), ... ]
freq=
"M"
#month
df.first_valid_index()
#index of first non NaN row
pandas数据整合
concatenate
df = pd.concat( [df1, df2, s1, s2, ...], keys=?, axis = ?, join=?, join_axes=?)
#can concat dataframes and series together
keys=
["c1", "c2", ...]
#add leftmost index column of c1, c2, ...
axis=
0
#concat by rows
1
#concat by columns
join=
"inner"
#remove all NaN rows/columns
NO "left" / "right"
join_axes=
[df1.index]
df.loc["c1"]
#dataframe of df1 only
append
df = df1.append( [df2, df3, s1, s2, ...])
merge
df = pd.merge( df1, df2, on="common col", how=?)
how=
"inner"
#default
"outer"
"left"
"right"
join
df = df1.join(df2)
how=
"inner"
"outer"
"left"
#default
"right"
matplotlib数据可视化
import numpy as np
df.shape
#tuple: (no. of rows, no. of columns)
df = df.apply(lambda x: float(x))
#non-numeric values to float
plot
df.plot()
df.plot(grid=True, figsize=(10, 20))
df["Adj Close"].plot()
#from pandas
4. 数据建模
模型种类
分类、回归
e.g. 风险评估、业务增长、房价预测
分类:数据映射到预先定义(基于数据属性值)的类
e.g. 逻辑回归、决策树、SVM、随机森林
回归:基于历史数据的属性值,利用误差分析拟合函数,预测未来趋势
聚类
e.g. 症状归纳疾病、发现高级用户、精准营销
对未被标记的数据,根据信息相似度划分
时序分析
e.g. 下季度销量、明天用电量
基于历史数据的时间等序列规律,预测未来的值
逻辑/罗吉斯回归Logistic regression
非线性拟合,解决分类问题
基本模型
1. 样本 X(x0, x1, ..., xn)
2. 拟合参数 Θ(θ0, θ1, ..., θn)
3. 线性表示 Z = Θ^T ·X
4. 预测函数 hΘ(X) = sigmoid函数 g(Z) = 1 / ( 1+e^(-Z) ) ∈ [0, 1]
5. 正样本(y=1)概率 P = hΘ(X)负样本(y=0)概率 P = 1 - hΘ(X)
6. 损失 cost(hΘ(X), y) = -log(P)
7. 损失函数 J(Θ) = 1/m * Σcost( hΘ(x^(i)), y^(i) )
通过样本训练求出最优Θ,使得J(Θ)最小化
梯度下降 Gradient descent
接收操作者特征曲线下面积Receiver Operating Characteristic curve'sarea under curve
二分类模型的评价指标
绘制方法
1. TP, FP, TN, FN = True/False Positive(1)/Negative(0)
2. rate TPR = TP / (TP + FN)rate FPR = FP / (TN + FP)
3. ROC曲线: TPR(y轴) by FPR(x轴)
4. 对不同阈值∈ (0, 1),重复绘制
判断
> 0.5
优于随机
= 0.5
等于随机
< 0.5
差于随机,除非只用于反预测
sklearn
LR
from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import roc_curve, auc
X,y_train,test = train_test_split(X, y, test_size=?, random_state=?)
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_score = lr_model.predict_proba(X_test)
ROC
fpr, tpr, thresholds = roc_curve(y_test, y_score[:, 1]
roc_auc = auc(fpr, tpr)
3. 数据分析
inspect and cleanse
readable, complete, unique, authentic, legal
观察
列间比较
count比较
寻找关键特征
plot(type='bar')
sns.heatmap()
整理
drop多余行列
preprocessing.LabelEncoder() transform关键特征
#sklearn
np.random.choice 节选较多数据的样本以处理数据不平衡