导图社区 Big Data Project
Big Data Project:Structed Data(Traditional ML Model)、Unstructed Data(Text)(Untraditional ML Model).
社区模板帮助中心,点此进入>>
论语孔子简单思维导图
《傅雷家书》思维导图
《童年》读书笔记
《茶馆》思维导图
《朝花夕拾》篇目思维导图
《昆虫记》思维导图
《安徒生童话》思维导图
《鲁滨逊漂流记》读书笔记
《这样读书就够了》读书笔记
妈妈必读:一张0-1岁孩子认知发展的精确时间表
Big Data Project
Structed Data (Traditional ML Model)
Task Conceptualization (明确建模目标)
Data Collection
Data Preparation&Wrangling
Data Cleansing
Incompleteness error
Invalidity error
Inaccuracy error
Inconsistent error
Non-uniformity error
Duplicate error
Data Preprocessing
Data Transformation (数据变换)
Extraction
Aggregation (加总)
Filtration (过滤)
Selection (Remove the data column)
Conversion
Data Scaling (数据换算)
Normalization (正规化,[0,1])
Xi=(Xi-Xmin)/(Xmax-Xmin)
Standardization
Xi=(Xi-u)/sigma
Data Exploration
Explotory Data Analysis(EDA)
Graphs
Charts
Other Visualizations like Heat maps and word clouds
Feature Selection
Choose pertinent(相关的) features from dataset for ML model training. (eg R²)
Feature Engineering
A process of creating new features by changing or transforming existing features.
Moel Training
Model Selection (governed by 3 factors)
Supervised or Unsupervised learning
Type of Data
Size of Data
Performance Evaluation
For binary classification models
Error Analysis
Confusion Matrix
Precision Metric
The ratio of correctly predicted positive classes to all predicted positive classes.
P=TP/(TP+FP)
Recall Metric
The ratio of correctly predicted positive classes to all actual positive classes.
R=TP/(TP+FN)
Accuracy Metric
The percentage of correctly predicted classes out of total predictions.
Accuracy=(TP+TN)/(TP+FP+TN+FN)
F1 Score
The harmonic mean of Precision and Recall
F1 Score=2PR/(P+R)
Receiver Operating Characteristic(ROC)
False Positive Rate(FPR)
The rato of fasely predicted positive classes to all actual negative classes.
FPR=FP/(FP+TN)
True Positive Rate(TPR)
The ratio of correctly predicted positive classes to all actual positive classes.(Recall)
TPR=TP/(TP+FN)
For continuous data prediction (especially for regression)
RMSE,that is, SEE
Tuning
Parameters
Learned from the training data as part of the training process and dependent on the training data.
Hyperparameters
Manually set and tuned and not dependent on the training data.
Unstructed Data(Text) (Untraditional ML Model)
Text Problem Formulation (构建文本问题)
Data Curation (数据护理)
Remove all html tags(e.g, http)
Remove punctuations (Most, not all)
Remove numbers
Remove white spaces
Text normalization
Lowercasing(小写化),stop words remove, stemming(词干提取),lemmatization(词形还原)
Creation of bag-of-words (BOW,词袋)
A set of words and no position or sequence presented in the text
Building of document term matrix (DTM,文档-词矩阵)
DTM is a matrix that each row belongs to a document(or text file) and each column represents a term(or token)
Exploration Data Analysis(EDA)
Text Classification
Classify texts into different classes with supervised ML approaches
Topic Modeling
Group the texts into topic clusters with unsupervised ML approaches
Sentiment Analysis
Term Frequency(TF,词频) to creat a word cloud
Frequency measure
Can be used to remove noisy feature
Chi-square test
Can be used to test independence of two events
Mutual information
Model Training