导图社区 数据分析—建模及机器学习
机器学习的常用模型速查 包括 回归模型 分类模型 聚类模型 深度学习 集成算法 模型调参 模型评估 特征工程
编辑于2020-07-14 14:26:14常用建模 及机器学习
回归模型-Regression
线性回归
from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X_train1,Y_train1) Y_predict1 = model.predict(X_predict1)
statsmodels 拟合
import statsmodels.api as sm X_train=多列的数组 X_train=sm.add_constant(X_train) res=sm.OLS(y_train,X_train).fit() res.summary() 查看参数 R-squared: 相关系数,越大越好 Prob (F-statistic): 越小越好 BIC: 贝叶斯指标 越大越好 P>|t| :要小于0.05,否则舍弃该项 coef: const 常数,后面分别是每列的系数 y_predict=ax1+bx2+cx3+d
非线性回归
scipy建模
拟合方法:from scipy.optimize import curve_fit popt,pocv = curve_fit(func,X_train,y_train) a,b,c,d = popt y_predict = func(X_train,a,b,c,d)
numpy 拟合
p = np.polyfit(X_train,y_train,deg=2) y_predict = np.polyval(p,X_train)
sklearn拟合 更优秀
from sklearn import linear_model from sklearn.preprocessing import PolynomialFeatures ploy_reg = PolynomialFeatures(degree=2) #构建二阶多项式 X_train = ploy_reg.fit_transform(X.reshape(-1,1)) #对每个数值进行多项式计算 model = linear_model.LinearRegression() #构建线性拟合模型 model.fit(X_train,Y_train) #用模型训练数据 model.intercept_,model.coef_ ##查看回归方程常数项和系数 Y_predicted=model.predict(X_train)
逻辑回归
from sklearn.linear_model import LogisticRegression
随机森林回归
from sklearn.ensemble import RandomRegressor
决策树回归
from sklearn. tree import DecisionTreeRegressor
KNN 近邻回归算法
适用于:根据房间数量预测房价
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor() knn.fit(Xtrain, ytrain) predictions = knn.predict(Xtest)
XGBoost-回归计算
from xgboost import XGBRegressor
分类模型-Classes
随机森林分类
from sklearn.ensemble import RandomForestClassifier
决策树分类
from sklearn. tree import DecisionTreeClassifier
KNN 近邻分类算法
from sklearn.neighbors import KNeighborsClassifier
贝叶斯算法
子主题
from sklearn.naive_bayes import MultinomialNB classifier = MultinomialNB() .fit(feature, y_train) vec.transform(test_words)
支持向量机 SVM
一种二分类模型
找到一条线(W和b的方程)。使得离该线最近的点的距离能够最远
解决决策边界问题, 并进行分类预测
使用线性核函数分界
from sklearn.svm import SVC # "Support vector classifier" model = SVC(kernel='linear',C=0.1) model.fit(X, y)
使用高斯核函数非线性分界
clf = SVC(kernel='rbf', C=1,gamma=0.1) clf.fit(X, y)
LightGBM-分类器
子主题
XGBoost-分类计算
from xgboost import XGBClassifier
聚类模型-Clustering
K 近邻算法 K-means
半监督算法:设定K值,标签个数
使用二维及以上的数据进行聚类计算,先自定义中心点数值和个数,然后计算每个数据与每个中心点的距离,迭代多次,取最小距离那次聚类,给数据分类
from sklearn.cluster import KMeans km = KMeans(n_clusters = 5,init = 'random',n_init = 1,max_iter=1,random_state=1) km.fit_transform(similarity_df)
DBSCAN算法
无监督算法:根据半径值,算出标签个数
from sklearn.cluster import DBSCAN dbscan = DBSCAN(eps = 0.05,min_samples=5) dbscan.fit(X)
BIRCH算法
from sklearn.cluster import Birch
深度学习 Deep Learning
是机器学习的子集,而机器学习属于人工智能的一部分 主要作用是通过算法自动提取所需的特征,进行建模——关键在于特征工程
2012年 ALEX——ImageNet 开始兴起
应用领域
自动驾驶,人脸识别,图像分类,病灶识别,星图识别
数据规模越大,效果越好
机器学习常规流程
收集数据并给定标签
训练分类器
测试
评估
神经网络
得分函数
给每个数据集(图像等)中的每个值(像素点)进行类别打分

W 为权重值,几个类别就有几种权重,分别对每个特征值中的值随机赋予权重值 +b 为每种类别的微调值
损失函数

对上面的得分进行处理,用每个除自己之外的类别打分减自己的得分+1,然后与0比较取消最大值,值越小越好,说明分类越好


softmax分类器

网络结构


神经网络的特征
1. 梯度下降
误差方程 (Cost Function)
局部最优(Local minima)
全局最优解(Global minima)
2. 变量(Variable)
一个存放会变化的值的地理位置. 里面的值会不停的变化
3. 神经层(Neural Layer)
输入层(Iuput Layer)
l1=input_layer(input,in_size,out_size,activation_function)
隐藏层(Hidden Layers)
hidden_layer1(l1_output,li_outsize,自定size,activation_function/None)
......
子主题
输出层(Output Layer)
子主题
4. 激励函数(Activation Function)
relu
sigmoid
tanh
softplus
softmax
elu
5. 运算层(hidden-layers)
Use function as relu、tanh、softplus
6. 分类器(classifier)
Use function as sigmoid、softmax for the output layer
7. 回归(regression)
Use function as lineaer for the output layer
8. 减小过拟合(overfitting) 降低预测误差
丢弃部分神经元 (dropout)
通过制定概率p启用部分神经元进行每层的计算,不启用全部神经元,用多个部分神经的计算加总来避免过拟合问题
L1,L2正则化(regularization)
把所有神经层的所有 weights 做平方和
把所有神经层的所有 weights 做绝对值的和
交叉验证(cross validation)
子主题
9. 优化器(Optimizer)
10. 卷积神经网络CNN (Convolutional Neural Network)
子主题
11. 循环神经网络RNN (Reccurrent Neural Netwok)
梯度弥散、梯度爆炸
LSTM 神经网络 Long Short Trem Memory
输入层
输出层
忘记层
12. 自编码(Autoencoder)
子主题
13. 批标准化(batch Normalization)
子主题
14. 强化学习 Reintorcement Learning
DQN(Deep Q Network)
融合NN 和 Q-learning
Experience replay
Fixed Q-targets
GAN—生成对抗网络 (Generative Adversarial Nets)
根据随机数来生成有意义的数据
Generator
Discriminator
15. 保存及提取训练好的model (save and restore model)
import pickle from sklearn.externals import joblib
save model
with open('save/clf.pickle','wb') as f: pickle.dump(clf,f)
joblib.dump(clf,'save/clf.pkl')
restore model
with open('save/clf.pickle','rb') as f: clf2 = pickle.load(f)
joblib.load('save/clf.pkl')
集成算法 ensemble
单模型集成
比如决策树模型只能建立一颗数: from sklearn.tree import DecisionTreeClassifier
而随机森林模型可以建立多棵树,相当于将多个决策树模型集成: from sklearn.ensemble import RandomForestClassifier
多模型集成
直接平均化
导入模块
将多个模型分别建模预测,然后平均化预测值 from sklearn.svm import SVC, LinearSVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.neural_network import MLPClassifier from sklearn.kernel_approximation import Nystroem from sklearn.kernel_approximation import RBFSampler from sklearn.ensemble import RandomForestClassifier
创建模型
def get_models(): """Generate a library of base learners.""" nb = GaussianNB() svc = SVC(C=100, probability=True) knn = KNeighborsClassifier(n_neighbors=3) lr = LogisticRegression(C=10, penalty = 'l1',solver='liblinear', random_state=SEED) nn = MLPClassifier((80, 10), early_stopping=False, random_state=SEED) gbc = GradientBoostingClassifier(n_estimators=100, random_state=SEED) rf = RandomForestClassifier(n_estimators=10, max_features=3, random_state=SEED) models = {'svc': svc, 'knn': knn, 'naive bayes': nb, 'mlp-nn': nn, 'random forest': rf, 'gbc': gbc, 'logistic': lr, } return models
预测结果
def train_predict_scores(xtrain,xtest, ytrain, ytest, models): """Fit models in list on training set and return preds""" P = np.zeros((ytest.shape[0], len(models))) P = pd.DataFrame(P) print("Fitting models.") cols,scores = [],[] for i, (name, m) in enumerate(models.items()): print("%s..." % name, end=" ", flush=False) m.fit(xtrain, ytrain) P.iloc[:, i] = m.predict_proba(xtest)[:, 1] """Score model in prediction DF""" score = roc_auc_score(ytest, P.iloc[:,i]) scores.append([name,score]) #print("%-26s: %.3f" % (name, score)) cols.append(name) print("done") P.columns = cols print("Done.\n") return P,scores
按权重平均化
Stacking模型
定义第一层基础模型
base_learners = get_models()
定义权重分配模型
meta_learner = GradientBoostingClassifier( n_estimators=1000, loss="exponential", max_features=4, max_depth=3, subsample=0.5, learning_rate=0.005, random_state=SEED )
训练基础模型
def train_base_learners(base_learners, inp, out, verbose=True): """Train all base learners in the library.""" if verbose: print("Fitting models.") for name, m in base_learners.items(): if verbose: print("%s..." % name, end=" ", flush=False) m.fit(inp, out) if verbose: print("done")
训练权重分配模型
ef predict_base_learners(pred_base_learners, inp, verbose=True): """Generate a prediction matrix.""" P = np.zeros((inp.shape[0], len(pred_base_learners))) if verbose: print("Generating base learner predictions.") for i, (name, m) in enumerate(pred_base_learners.items()): if verbose: print("%s..." % name, end=" ", flush=False) p = m.predict_proba(inp) # With two classes, need only predictions for one class P[:, i] = p[:, 1] if verbose: print("done") return P
预测模型
def ensemble_predict(base_learners, meta_learner, inp,out,xtest,ytest,verbose=True): """Generate predictions from the ensemble.""" # 用训练集分割出来的测试集进行第一层的基础预测 P_base = predict_base_learners(base_learners,inp) # 使用第一层集合预测值、测试集分割出来测试集 进行权重分配拟合 meta_learner.fit(P_base, out) # 使用训练集 行第一层的基础预测 P_pred = predict_base_learners(base_learners, xtest, verbose=verbose) # 训练集的预测值 进行权重分配预测 P_ense = meta_learner.predict_proba(P_pred)[:, 1] print("\nEnsemble ROC-AUC score: %.3f" % roc_auc_score(ytest, P_ense)) return P_pred, P_ense
多模型集成并行计算
from mlens.ensemble import SuperLearner
sl = SuperLearner(folds=10,random_state=SEED,verbose=2, backend="multiprocessing",n_jobs=-1) sl.add(list(base_learners.values()), proba=True) sl.add_meta(meta_learner, proba=True) sl.fit(Xtr, ytr) p_sl = sl.predict_proba(Xte) print("\nSuper Learner ROC-AUC score: %.3f" % roc_auc_score(yte, p_sl[:, 1]))
模型调参
交叉验证
from sklearn.model_selection import KFold
随机调参器
from sklearn.model_selection import RandomizedSearchCV
遍历调参器
from sklearn.model_selection import GridSearchCV(
子主题
模型评估-Metrics
回归模型评估
均方绝对误差
from sklearn.metrics import mean_absolute_error
均方差误差MSE
from sklearn.metrics import mean_squared_error
确定系数
from sklearn.metrics import r2_score
分类模型评估
混淆矩阵
from sklearn.metrics import confusion_matrix
精度分类
from sklearn.metrics import accuracy_score
召回值
from sklearn.metrics import recall_score
AUC
from sklearn.metrics import roc_auc_score
F1
from sklearn.metrics import f1_score
聚类模型评估
子主题
数据预处理 Preprocessing
数据分割
from sklearn.model_selection import train_test_split
维度不变
缺失值填充
from sklearn. impute import SimpleImputer
数值型特征的二值特征化
将非0的数字转换成1
from sklearn.preprocessing import Binarize bn = Binarizer(threshold=0.9) pd_watched = bn.transform([df[col]])[0] or pd_watched = bn.transform(np.array(df[col]).reshape(-1,1))
watched = np.array(df[col]) watched[watched >= 1] = 1 df[col] = watched
数值型特征标准化
from sklearn. preprocessing import StandardScaler std = StandardScaler().fit(X)
公式
数值型特征归一化
from sklearn. preprocessing import MinMaxScaler mms = MinMaxScaler().fit(X)
公式
连续数值binnig化 离散化,标签化
df[col] = np.array(np.floor(np.array(df['Age']) / 10.)) or df[col] = np.array(np.ceil(np.array(df['Age']) / 10.))
数值特征值分位数化
from sklearn. preprocessing import QuantileTransformer qt = QuantileTransformer(n_quantiles=10,random_state=0) inmm = qt.fit_transform([fcc_survey_df['Income']])[0]
数字标签,切n段 也可改成字符标签
df['quantile_label'] = pd.qcut(df['Income'], q=np.linspace(0,1,n+1), labels=np.linspace(0,9,n))
区间标签
df['quantile_label'] = pd.qcut(df['Income'], q=np.linspace(0,1,n))
字符型特征标签化
from sklearn. preprocessing import OrdinalEncoder ole = OrdinalEncoder().fit(np.array(df[col]).reshape(-1,1)) labels = ole.transform(np.array(df[col]).reshape(-1,1)) labels.flatten() or labels.flat
from sklearn. preprocessing import LabelEncoder gle = LabelEncoder() labels = gle.fit_transform(df[col])
map
df[col] = df[col].map({v:k for k,v in enumarate(df[col].unique())})
df['col'] = pd.Categorical(df.col).codes or X[col] = X[col].astype('category').cat.codes
数值型特征对数变换
df[col] = np.log(1+df[col])
数据升维
字符型特征onehot化
pd.get_dummies(data)
from sklearn. preprocessing import OneHotEncoder gen_ohe = OneHotEncoder() gen_feature_arr = gen_ohe.fit_transform(df[col]).toarray()
字符型特征 文本单词计数模型 词袋模型
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(min_df=0., max_df=1.).fit(words) cv_matrix = cv.transform(words).toarray()
字符型特征 N-Grams模型 文本词语计算模型
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(ngram_range=(2,2)) cv_matrix = bv.fit_transform(words).toarray()
字符型特征 逆文档频率模型 单词权重百分比模型
from sklearn.feature_extraction.text import TfidfVectorizer tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True) tv_matrix = tv.fit_transform(words).toarray()
数值特征值多项式
from sklearn. preprocessing import PolynomialFeatures pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False) res = pf.fit_transform(df[col1,col2])
数值特征值函数处理
from sklearn. preprocessing import FunctionTransformer
日期的分解升维
df['date'] = df['TS_obj'].apply(lambda d: d.date()) # 提取年月日 df['Year'] = df['TS_obj'].apply(lambda d: d.year) df['Month'] = df['TS_obj'].apply(lambda d: d.month) df['Day'] = df['TS_obj'].apply(lambda d: d.day) df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek) df['DayName'] = df['TS_obj'].apply(lambda d: d.day_name()) df['MonthName'] = df['TS_obj'].apply(lambda d: d.month_name()) df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear) df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear) df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter) df['Day_3'] = pd.cut(df['Day'],bins=np.linspace(1,31,4),labels=['upper','middle','down'])
时间的分解升维
df['date'] = df['TS_obj'].apply(lambda d: d.time()) # 提取时分秒 df['Hour'] = df['TS_obj'].apply(lambda d: d.hour) df['Minute'] = df['TS_obj'].apply(lambda d: d.minute) df['Second'] = df['TS_obj'].apply(lambda d: d.second) df['MUsecond'] = df['TS_obj'].apply(lambda d: d.microsecond) #毫秒 df['UTC_offset'] = df['TS_obj'].apply(lambda d: d.utcoffset()) #UTC时间位移 hour_bins = [-1, 5, 11, 16, 21, 23] bin_names = ['Late Night', 'Morning', 'Afternoon', 'Evening', 'Night'] df['TimeOfDayBin'] = pd.cut(df['Hour'], bins=hour_bins, labels=bin_names)
数据降维
PCA主成分分析
from sklearn.decomposition import PCA pca = PCA(n_components=2, svd_solver='full') pca.fit(X)
LDA线性判断分析
from sklearn.decomposition import LatentDirichletAllocation lda = LatentDirichletAllocation(n_components=2, max_iter=100, random_state=42) dt_matrix = lda.fit_transform(tv_matrix)
通过相关系数筛选 掉高相关特征
def choice_weakcorr_features(df,corrnum): cors = X.corr(method='spearman') weakcorr_col = X.columns.tolist() # 创建整个特征值名称列表 for col in X.columns: # 遍历每个特征值名称 if col not in weakcorr_col: # 如果 特征名称不在列表中,直接跳过——目的是跳过已经删除的特征名称 continue # 这一步非常重要 col_cor = cors[col] #取该特征与其他特征的相关系数 cols = col_cor[col_cor > 0.6].index.tolist() # 取与该特征 相关度大于0.6的其他特征名称 if col in cols: cols.remove(col) # 删掉自己 if cols: # 如果提取大于0.6的特征名称列表不为空 for c in cols: # 遍历这个列表 if c in weakcorr_col: # 如果筛选列表中的特征名称在特征名称初始列表中 weakcorr_col.remove(c) # 删掉它 return weakcorr_col
随机森林特征重要度选择法
def importance_features_from_rfr(X,y,test_size,cumsum): Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=test_size,random_state=42) rf = RandomForestRegressor().fit(Xtrain,ytrain) imp = pd.DataFrame([[col,fea] for col,fea in zip(X.columns,rf.feature_importances_)], columns=['cols','importance']) imp.sort_values('importance',ascending=False,inplace=True) imp['cumsum'] = imp['importance'].cumsum() importance_features = imp[imp['cumsum'] < cumsum]['cols'] return importance_features