导图社区 Big Data Project

Big Data Project

Big Data Project：Structed Data(Traditional ML Model)、Unstructed Data(Text)(Untraditional ML Model).

编辑于2022-06-26 10:29:15

CFA
Big Data Project

EDJrvgPt

他的近期作品查看更多>>

Big Data Project
Big Data Project：Structed Data(Traditional ML Model)、Unstructed Data(Text)(Untraditional ML Model).

Big Data Project

社区模板帮助中心，点此进入>>

EDJrvgPt

他的近期作品查看更多>>

Big Data Project
Big Data Project：Structed Data(Traditional ML Model)、Unstructed Data(Text)(Untraditional ML Model).

相似推荐
大纲

论语孔子简单思维导图
- 74.2k
- 827
- 1.0k
- 418
MindMaster
《傅雷家书》思维导图
- 122.0k
- 1.6k
- 2.6k
- 1.2k
MindMaster
《童年》读书笔记
- 41.8k
- 460
- 958
- 331
MindMaster
《茶馆》思维导图
- 10.2k
- 171
- 181
- 39
MindMaster
《朝花夕拾》篇目思维导图
- 22.9k
- 491
- 1.1k
- 289
MindMaster
《昆虫记》思维导图
- 28.0k
- 255
- 769
- 270
MindMaster
《安徒生童话》思维导图
- 15.2k
- 259
- 259
- 64
MindMaster
《鲁滨逊漂流记》读书笔记
- 19.0k
- 289
- 544
- 163
MindMaster
《这样读书就够了》读书笔记
- 93.2k
- 12.2k
- 8.9k
- 2.2k
Ethan
妈妈必读：一张0-1岁孩子认知发展的精确时间表
- 7.7k
- 1.6k
- 396
- 42
Ethan

Big Data Project

Structed Data (Traditional ML Model)

Task Conceptualization （明确建模目标）

Data Collection

Data Preparation&Wrangling

Data Cleansing

Incompleteness error

Invalidity error

Inaccuracy error

Inconsistent error

Non-uniformity error

Duplicate error

Data Preprocessing

Data Transformation (数据变换)

Extraction

Aggregation （加总）

Filtration (过滤)

Selection （Remove the data column）

Conversion

Data Scaling （数据换算）

Normalization (正规化,[0,1])

Xi=(Xi-Xmin)/(Xmax-Xmin)

Standardization

Xi=(Xi-u)/sigma

Data Exploration

Explotory Data Analysis(EDA)

Graphs

Charts

Other Visualizations like Heat maps and word clouds

Feature Selection

Choose pertinent(相关的) features from dataset for ML model training. （eg R²）

Feature Engineering

A process of creating new features by changing or transforming existing features.

Moel Training

Model Selection （governed by 3 factors）

Supervised or Unsupervised learning

Type of Data

Size of Data

Performance Evaluation

For binary classification models

Error Analysis

Confusion Matrix

Precision Metric

The ratio of correctly predicted positive classes to all predicted positive classes.

P=TP/(TP+FP)

Recall Metric

The ratio of correctly predicted positive classes to all actual positive classes.

R=TP/(TP+FN)

Accuracy Metric

The percentage of correctly predicted classes out of total predictions.

Accuracy=(TP+TN)/(TP+FP+TN+FN)

F1 Score

The harmonic mean of Precision and Recall

F1 Score=2PR/(P+R)

Receiver Operating Characteristic(ROC)

False Positive Rate(FPR)

The rato of fasely predicted positive classes to all actual negative classes.

FPR=FP/(FP+TN)

True Positive Rate(TPR)

The ratio of correctly predicted positive classes to all actual positive classes.(Recall)

TPR=TP/(TP+FN)

For continuous data prediction (especially for regression)

RMSE,that is, SEE

Tuning

Parameters

Learned from the training data as part of the training process and dependent on the training data.

Hyperparameters

Manually set and tuned and not dependent on the training data.

Unstructed Data(Text) (Untraditional ML Model)

Text Problem Formulation (构建文本问题)

Data Curation (数据护理)

Data Preparation&Wrangling

Data Cleansing

Remove all html tags(e.g, http)

Remove punctuations (Most, not all)

Remove numbers

Remove white spaces

Data Preprocessing

Text normalization

Lowercasing(小写化),stop words remove, stemming(词干提取),lemmatization(词形还原)

Creation of bag-of-words （BOW，词袋）

A set of words and no position or sequence presented in the text

Building of document term matrix (DTM,文档-词矩阵)

DTM is a matrix that each row belongs to a document(or text file) and each column represents a term(or token)

Data Exploration

Exploration Data Analysis(EDA)

Text Classification

Classify texts into different classes with supervised ML approaches

Topic Modeling

Group the texts into topic clusters with unsupervised ML approaches

Sentiment Analysis

Term Frequency(TF,词频) to creat a word cloud

Feature Selection

Frequency measure

Can be used to remove noisy feature

Chi-square test

Can be used to test independence of two events

Mutual information

Feature Engineering

Model Training