导图社区 CFA二级,大数据
CFA二级,大数据分析的步骤、数据挖掘、数据分析、机器学习
编辑于2020-04-18 14:51:21Big Data in Investment Management
data
data:4Vs
volume
variety
velocith
veracity
Textual big data
topics
sentiment
Steps
Structured Data
Conceptualization of the modeling task
Data collection.
Data preparation and wrangling.
cleansing and preprocessing of the raw data
Data exploration
data analysis, feature selection, and feature engineering.
Model training
Unstructured Data
Text problem formulation
Data (text) curation
web spidering (scraping or crawling) programs
Text preparation and wrangling.
structured inputs
steps of text cleansing process
1. remove html tags
2. remove punctuations
3. remove numbers
4. remove white spaces
steps of text wrangling
normalization process
lowercasting
stop words
the,is, a
stemming
lemmatization
Text exploration
word clouds
Model training
Data collection
databases
application programming interface (API)
Data Preparation and Wrangling
Data Preparation (Cleansing)
Incompleteness error
Missing values/and NAs
mean, median, or mode of the variable or simply assuming zero
Invalidity error
Invalidity error is where the data are outside of a meaningful range, resultingin invalid data.
verifying other administrative data records
Inaccuracy error
Inaccuracy error is where the data are not a measure of true value.
“Don’t Know”
rectified with the help of business records and administrators
Inconsistency error
称呼:李女士,性别男
Non-uniformity error
Non-uniformity error is where the data are not present in an identical format
Date of Birth column is present in various formats:12/5/1970;15 Jan, 1975
converting the data points into a preferable standard format.
Duplication error
Data Wrangling (Preprocessing)
dealing with outliers, extracting useful variables from existing data points, and scaling the data.
transformations
Extraction
从现存的变量中提取出新的变量
从Date of Birth 提取出Age;如创建新的字段“贷款总额占收入比重”
Aggregation
Salary and Other Income——> Total Income
Filtration:
从“city”列中过滤“Beijing"的行
Selection
只选择有用的feature
Conversion
将金额统一换算为USD
Outliers
Standard deviation
3 standard deviations from the mean
interquartile range (IQR)
the 75th and the 25th percentile values of the data
outside of 1.5 IQR:outliers
outside of 3 IQR:extreme values
trimming
a 5% trimmed dataset is one for which the 5% highest and the 5% lowest values have been removed.
winsorization缩尾处理
The process of replacing extreme values and outliers in a dataset with the maximum (for large value outliers) and minimum (for small value outliers) values of data points that are not outliers.
Scaling
Normalization
对异常值敏感,适用于分布未知的情形
Standardization
适用于服从正态分布的样本
Unstructured (Text) Data
text, images, videos, and audio files
ML model中,非结构化数据必须转换为结构化数据
Text processing
cleansing
regular expression (regex)
a series that contains characters in a particular order
steps
1 Remove html tags:
2 Remove Punctuations
注意特殊标点符号,如句号、文号等表示断句或语义,连字符、下划线等保持词语完整,点好灯表示缩略语。
3 Remove Numbers
information extraction (IE)
4 Remove white spaces
子主题
preprocessing
浮动主题