导图社区 注册金融分析师CFA二级Quantitative method知识框架
注册金融分析师二级科目 quantitative methods, 全覆盖考纲思维导图和重点细节考点。
编辑于2022-03-10 12:05:13Quantitative Methods
Quantitative Method (1)
Linear regression
Assumptions
Linearity: Y与b1的线性关系
Homoskedasticity ---异方差问题
Independence------序列相关问题
Normality
Point eatimate
b0、b1公式
计算器计算b0、b1
Confidence interval eatimate
confidence interval计算
b1的standard error 可从 ANOVA TABLE中t.stat结果反推
Test of regression coefficient
significant test about regression coefficient
t-statistics
p-value
confidence interval method
hypothesis test about regression coefficient ---t-statistics
significant test for correlation
F-test: Ho: b1=b2=…=bk=0
ANOVA Table
SS, df, MSS,
SEE
Coefficient of determination R2, Multiple R
Estimate of Y
point estimate
confidence interval estimate
Different functional form: log
Limitations of regression
relation can change over time
public knowledge of regression
regression assumptions
Multiple regression
multiple vs linear regression区别
Partial regression coefficient:holding other constant
no exact linear relations: 若违反则multicollinearity
Hypothesis test
单个参数bi:t-test自由度不同
F-test:k不同
F vs T-test
单元:t^2=F
多元:F考虑x之间的相关性,影响t、F结果
Adjusted R-square
公式
R2>adjusted R2
may be<0
Dummy variable:X
n~n-1dummy variable
intercept和estimated coefficient解释
t-test的结论解释
Assumption violations
Heteroskedasticity
unconditional vs conditional
Effect
不影响
点估计结果b
参数估计的consistency
影响
区间估计t·Sb
检验
t-stat.--Sb
MSE偏小, P(type I error)上升
MSE偏大, P(type II error)上升
F stat.--MSE
MSE偏小, P(type I error)上升
MSE偏大, P(type II error)上升
Detecting
scatter plot
BP test
Ho: no Heteroskedasticity
Chi-square test: BP=n·R(residual)^2
Chi-square distribution, df=k, 单尾
Correcting
roubust standard errors(White-corrected standard error)
generalized least squares
Serial correlation(autocorrelation)
Postive vs Negative SC
Effect
不影响
点估计结果b
参数估计的consistency
影响
区间估计t·Sb
t/F检验结果
正SC,MSE偏小, P(type I error)上升
负SC,MSE偏大, P(type II error)上升
Detecting
scatter plot
DW test
Ho: no SC/ no positive SC
DW=2·(1-r)
a, k, n, decision rule
Correcting
Hansen method(roubust standard errors) :注意与怀特法的区别
generalized least squares
Multicollinearity
matter of degree rather than absence or presence
Effect
不影响
参数估计的consistency
影响
点估计结果b
区间估计t·Sb
t/F检验结果
Multicollinearity,MSE偏大, P(type II error)上升
Detecting
Classic method: insignificant t+ significant F+ high R-square
Occasionally suggested method: r>0.7
Correcting: exclude one or more regression variables
Model misspecifications---inconsistent
包括
incorrect set of variables
incorrect regression equation's functional form
principle of model specifications
economic reasoning
nature of variable
parsimonious
examined for violations
useful out of sample
分类
Misspecified functional form
omitted variable
inappropriate variable scaling
inappropriate data pooling
Time-series mis specifications
lagged dependent variable as independent variable with serially correlated errors
function of a dependent variable as an independent variable
independent variables are measured with errors
Other types of time-series misspecifications(nonstationary)
relations among time series with trend
relations among time series that may be random walk
Qualitative dependent variable
probit and logit model
discrimination model: Z-score
Time series analysis
Trend model
linear trend model
Yt等差,Y changes at a constant amount: b1
散点趋近于直线
log-linear trend model
Yt等比,Y grows at an exponential rate: e^b1-1
散点呈指数趋势
limitation
time series data usually exhibit serial correlation, not appropriate for trend model.
Autoregressive model(AR)
multiperiod forecast: chain rule
Assumptions of AR
covariance stationary
强平稳vs弱平稳
3 conditions for covariance stationary
expected value constant and infinite over time
variance
covariance
性质
stationary past doesn't gurantee stationary in the future
covariance stationary time series has a finite mean-reverting level Xi=B0/(1-B1)
违反的Effect:unit root/B1=1/random walk
Random walk
Random walk with drift
features
no mean reverting level
infinite variance
Detecting
Unit root test of nonstationary: common t-test, Ho: B1=1
Dickey-Fuller test
Xt-Xt-1=bo+(b1-1)Xt-1+€
g=b1-1, Ho:g=0, Ha:g<0
revised t-table查表
Correcting:first-difference
errors are uncorrelated-违反-autocorrelation
Effect
不影响
点估计结果b
参数估计的consistency
影响
区间估计t·Sb: MSE---Sb
t/F检验结果
正SC,MSE偏小, P(type I error)上升
负SC,MSE偏大, P(type II error)上升
Detecting
DW不可用,因为误差项相关性应该为0,是显著性检验
T-test:Sr=1/(观测值个数)^0.5 观测值个数=样本容量-p df=T-k-1 reject Ho, r<>0, 存在autocorrelation
Correcting
adds seasonal lag
Homoskedasticity:ARCH
Effect
不影响
点估计结果b
参数估计的consistency
影响
区间估计t·Sb
检验
t-stat.--Sb
MSE偏小, P(type I error)上升
MSE偏大, P(type II error)上升
F stat.--MSE
MSE偏小, P(type I error)上升
MSE偏大, P(type II error)上升
Detecting
ARCH(1)
significance test for a1
t分布
Correcting: GLS
More than one time series
Cointegration
DF-EG test: reject Ho, Cointegration, can use multiple regression
Comparing Model Performance
定量
in-sample forecast errors
out-of-sample forecast errors:RMSE
定性
instability of regression coefficient
data from earlier and later
shorter and longer period of data
Quantitative Method (2)
Machine learning
machine learning V.S statistical approach
传统统计学需要假设分布
数据大小
线性/非线性
数据复杂度(维度)
X,Y的名称
hyperparameters
Types
Supervised learning
labeled training data
分类
regression model: continuous target variable
classification model
binary classification
multicategory classification
Unsupervised learning
Unlabeled data
分类
Dimension reduction
clustering
Deep learning & reinforcement learning
适用于supervised&unsupervised
based on neural network
deep learning: used for complex tasks
reinforcement: learn from its own prediction errors
overfitting
issue with Supervised Machine Learning
three nonoverlapping data sets
traning sample
validation sample--tuning
test sample--evaluate
three errors
bias error: 样本内误差, training data not fits model well, underfitting, high in sample error
variance error: 样本外误差, overfitting, high out-of-sample error
base error: residual errors, not preventable
fitting curve: optimal complexity of model
addressing method
complexity reduction: overfitting penalty
cross validation
in cross validation
k-fold cross validation
Supervised Learning Algorithms
Penalized regression-regression/continuous
Penalty term: LASSO v.s OLS--linear
Regularizition: applied to non-linear model
Support vector machine(SVM)-classification/distinct
机制: linear, 二分法, hyperplane, maximum margin, support vector, discriminant boundary
分类
hard margin: linear classifier
soft margin: not perfectly linear, trade-off between wider margin and classification error
适用:small-to-median size & complex high-dimensional data
K-nearest neighbor(KNN)-classification/distinct
机制:linear, classify a new observation by finding similarities, 少数服从多数
two concerns
hyperparameters k
k too small, high error rate
k too large, dilute the result by averaging
k is even, may no clear winner
hard to clearly define "similar"
适用:二分法/多分法
Classification and regression tree(CART)-regression&classification
机制
linear & non-linear
classification tree-categorical target variable regression tree-continuous target variable
no black box
decision tree
features, branches, cut off value
initial root node: widest separation, minimize classification error
decision node: lower within-group error
terminal node: classification error does not diminish much more from another split if Classification----majority of data points if regression----mean of labelled values
优缺点
优点: provide visual explanation
缺点: overfitting;to avoid
regularization
prune low explainatory power section
Ensemble and random forest-组合算法
ensemble learning
aggregation of heterogeneous learners
aggregation of homogenous learners: different traning data---bootstrap aggregating(bagging)反复抽样
random forest
variant of classification tree + data bagged from same data set
subset of features used in creating each tree---mitigate overfitting
determine final classification: wisdom of crowd
优点
protect against overfitting
reduce the ratio of noise to Signal--errors cancel out via different trees
缺点:black box
Unsupervised Learning Algorithms
Dimension reduction: principal components analysis
composite variable, Eigenvectors, Eigenvalue(RSS/TSS)--avoid multicollinearity
优缺点
优点:features减少,avoid overfitting
缺点:Eigenvectors是原来features组合而来, aren′t well-defined concept, could be perceived as black box
Clustering
k-means clustering
机制: hyperparameter k, k non-overlapping clusters, centroid
适用
very large data sets
high dimensional data
缺点
the choice of hyperparameters k affects outcomes
解决: using a range of values for k to find optimal number of clusters
hierarchical clustering
no predefined number of clusters
agglomerative (bottom-up) clustering
divisive (top-down) clustering
Neutral Networks
机制
artificial neural networks(ANN)
high dimensional data/ linear&nonlinear data
three types of layers
input layers: features
hidden layers: ways of transmitting data
output layer: one predict result
hyperparameters of 4-5-1
each node
summation operator---total net input
activation operator
transform total net input into final output of the node
light dimmer switch---decrease or increase the strength of input
non-linear & linear
neurons modelling
input, synaptic weights, bias term, total net input, summation operator,active function, output
forward propagation向前计算
修正调试
backward propagation: 向后计算, 调整synaptic weights
revision of hyperparameters based on out of sample performance
Applications
deep neural networks(DNNs)
more than20 hidden layers
useful in general for image, pattern and speech recognition
reinforcement learning: learns based on immediate feedback from(millions of) trials and errors-AlphaGo
Choice of ML algorithms
if data is complex(too many features)
yes
dimension reduction
no
if classification
yes
if supervised
yes
linear: KNN, SVM
nonlinear: CART, random forest, neural networks
no
linear: k-means clustering or hierarchical clustering
nonlinear: neural networks
no
linear: penalized regression
nonlinear: CART, random forest, neural networks
Big data projects
introduction
characters: volume, variety, velocity, veracity(validity), value
data analysis steps: conceptualize model task, data collection, data preparation and wrangling, data exploration, model training
structured data
1. conceptualize task/blueprint/modifiable plan
2. data collection
external data
access through API(application programming interface)
vendor: csv or other formats
internal data
3. data preparation and wrangling
data preparation(cleansing)
incompleteness error
invalidity error
inaccuracy error
inconsistency error
non-uniformity error
duplication error
outliers
trimming(truncation)
winsorization:replace with maximum or minimum value
data wrangling(preprocessing)
transformation
extraction: 生日-年龄
aggregation: salary+ revenue=total income
filtration: data rows that are not needed
selection: columns are not needed, eg.name和ID只需一个
conversion: CAD-USD
scaling
normalization
公式:归一化
优: used when data distribution unknown
缺: sensitive to outliers
standardlization
公式
优: less sensitive to outliers, as it depends on mean and standard deviation
缺: data must be normally distributed
4. data exploration
exploratory data analysis(EDA)
summary statistics
visualization
feature selection
feature engineering
5. model traning
method selection
performance evaluation
error analysis
confusion matrix
precision, recall, accuracy, F1 score
receiver operating characteristics(ROC)
shape of ROC curve
more convex curve-better
area under the curve(AUC): 0.5 random guessing
root mean square error(RMSE)-useful for regression model
model tuning
minimize total aggregate error
parameters&hyperparameters
altering the hyperparameters
each hyperparameter---confusion matrix
multiple hyperparameters
grid search: different combinations of hyperparameters
ceiling analysis: the part of the pipeline can potentially improve the performance
unstructured data
3. text preparation and wrangling
text preparation(cleansing)
remove HTML tags
remove punctuations: some need replaced with annotations
remove numbers
remove white spaces
text wrangling(preprocessing)
normalization
lowercasing
removal of stop words
stemming
lemmatization
bag-of-words(BOW) procedure: N-grams
document term matrix(DTM)
4. text exploration
EDA
text statistics: term frequency, co-occurrence
visualization
feature selection
reduction in BOW size
methods
document frequency(DF)
Chi-square
mutual information: MI=1, token更有辨识度
feature engineering
number
n-gram
name entity recognition(NER)
parts of speech(POS)