导图社区 CFA二级:机器学习Machine Learning(1)
机器学习的定义、机器学习的分类、评级机器学习模型效果的一些概念。
编辑于2020-01-05 03:14:41机器学习 (Machine Learning)
什么是机器学习
machine learning seeks to extract knowledge from large amounts of data with no priori restrictive assumptions。
find the pattern, apply the pattern
high dimensionality、non-linearity
分类
supervised learning (监督学习)
监督式学习的常见应用场景如分类问题和回归问题。常见算法有逻辑回归(Logistic Regression)和反向传递神经网络(Back Propagation Neural Network)
给定一个数据集并且给定正确答案(已经标定目标Y和特征X),,数据集中的每个例子,算法将预测得到例子的“正确答案”
Supervised learning involves ML algorithms that infer patterns between a set of inputs (the X’s) and the desired output (Y). The inferred pattern is then used to map a given input set into a predicted output.
多元线性回归是监督学习的一个例子
监督学习应用
如果目标变量是连续的,则为回归问题。 如果目标变量是分类或者排序,则为归类问题。 Supervised learning can be divided into two categories of problems, regression problems and classification problems, with the distinction between them being determined by the nature of the target (Y) variable. If the target variable is continuous, then the task is one of regression (even if the ML technique used is not “regression,” note this nuance of ML terminology). If the target variable is categorical or ordinal (i.e., a ranked category), then it is a classification problem. Regression and classification use different ML techniques.
回归(Regression)
回归问题聚焦于解决连续目标变量的预测,适用于存在包含大量数字特征的大数据集,许多特征存在相关性。eg.利用历史股票市场收益率预测未来股价的表现,或者利用公司历史的财务指标预测债券违约的概率。
Regression focuses on making predictions of continuous target variables.
Penalized regression
分类(classification)
分类问题侧重于将观察结果归类,当因变量(目标)是分类变量时,将结果与自变量(特征)相关联的模型称为“分类器”(classifier)。例子有财务造假or没造假(两分类的)、评级分配(多分类的、顺序的)
Classification focuses on sorting observations into distinct categories
unsupervised learning (非监督学习)
在非监督式学习中,数据并不被特别标识,学习模型是为了推断出数据的一些内在结构。常见的应用场景包括关联规则的学习以及聚类等。常见算法包括Apriori算法以及k-Means算法。
给出的数据集没有目标结果,算法自己去发现数据间的结构,适合于数据量太大或太复杂以至于人类无法直接观察的领域。
非监督学习应用
降维(dimension)
保留观察结果差异的同时,减少特征数。如在投资和风险管理领域,确定影响资产价格的主要因素。
A set of techniques for reducing in the number of features in a dataset while retaining variation across observations to preserve the information contained in that variation.
聚类(clustering)
将观察数据分类,例如将公司按照财务指标特征分类,而不是按照行业或地区分类。
Clustering has been used by asset managers to sort companies into empirically determined groupings (e.g., based on their financial statement data) rather than conventional groupings (e.g., based on sectors or countries).
Deep Learning and Reinforcement Learning (深度学习和强化学习)
基于神经网络(neural networks (NNs, or ANNs)
also called artificial neural networks NNS包括高度灵活的ML算法,已经成功地应用于以非线性和特征之间的相互作用为特征的各种任务。神经网络不仅是分类和回归的常用方法,而且是深度学习和强化学习的基础,它可以是监督的,也可以是无监督的。
deep learning
image classification、face recognition、speech recognition、natural language processing
Reinforcement Learning
计算机通过和自己的交互(或者算法形成的数据)学习
机器学习算法选择
模型表现的评估
数据集data set
training sample
validation sample
test sample
拟合
Overfitting过拟合
样本内拟合很好,但不能很好的预测样本外新数据。(把一些噪音或者随机波动算到模型中了)
对任何ML算法的评估都侧重于对新数据的预测误差,而不是对算法所拟合的数据(即训练数据)的拟合优度。
Bias error(偏差)
Bias error, or the degree to which a model fits the training data. Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and high in-sample error.
Algorithms with erroneous assumptions
underfitting、 high in-sample error
Variance error
Variance error, or how much the model’s results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance, causing overfitting and high out-of-sample error.
Unstable models pick up noise。
overfitting and high out-of-sample error.
Base error
Base error due to randomness in the data
underfitting
没找到数据间的恰当的关系
underfitting means the model does not capture the relationships in the data
robust fitting
学习曲线和拟合曲线
learning curve
fitting curve

预防过拟合
加入过拟合惩罚函数 overfitting penalty
1) preventing the algorithm from getting too complex during selection and training, which requires estimating an overfitting penalty
Occam’s razor奥卡姆剃刀
交叉验证cross-validation
2) proper data sampling achieved by using cross-validation, a technique for estimating out-of-sample error directly by determining the error in validation samples.