导图社区 机器学习课程总结
以理论为主,包括优化、泛化、监督学习、无监督学习等内容。
编辑于2022-01-05 19:58:19Machine Learning
Supervised Learning
Framwork 1,2
Training set, test set, validation set, loss function
Empirical loss, population loss
Optimization vs generalization
Memorization function
Cross validation
Overfit vs underfit, classical view vs modern view
Unsupervised learning
Semi-supervised learning
notice manifold assumption
Linear Method 6
Perceptron algorithm
Limitations and theoretical guarantees
Logistic regression
Cross entropy for probabilities
Motivation and intuitions
Ridge Lasso SVM 9 10
Ridge regression: definition, intuition
Lasso regression: definition, intuition
L1 'ball' gives a diamond , namely sparse intersections
Compressed sensing 7/8/9
Compare with Lasso
RIP condition
Theory analysis for compressed sensing
Compressed sensing in non-linear setting
understand the result
SVM 9/10
Hard margin
Soft margin
Understand how to use kernel, and why we use kernel
What is kernel trick?
What does Mercer’s theorem mean?
Decision Tree, Random Forest, Boosting 17/18/19/20
Decision tree and Boolean functional analysis
Understand why decision tree can be convert into a sparse low degree polynomial
KM, LMN, Compressed sensing
Gini index, can use that to fit a decision tree
Random forest’s algorithm
Adaboost algorithm
boosting framework
Training error proof
generalization proof
Gradient boosting interpretation
XGBoost
Unsupervised Learning
Framwork 2
PCA 15
Definitions
computing the important directions of the data set
Different interpretations of PCA
notice we need to center the data points
Power method
Nearest Neighbor 16/17
Problem formulation
NN -- RNN -- Randomized
Locality sensitive hashing
Understand the definition, proof and intuition
How to construct a LSH library for ℓ_2
Metric learning:
NCA
LMNN
K-means/Spectral graph cluster 24/25
K-means
How to construct a graph, and how it generalizes standard clustering
Graph Laplacian
Theorem of #connected components = # 0 eigenvalues of L
Spectral graph clustering algorithm
notice power method
The relationship between the algorithm and ratio cut problem
t-SNE 26
SNE algorithm
Crowding problem
keep relative ordering instead of exact distance, shift to longer distance
Why picking student t-distribution?
Optimization Theory
GD, SGD 3,4
Zeroth order, first order, second order methods
Smoothness, convexity, strong convexity
Hessian matrix, eigenvalue
Convergence analysis of GD on convex and smooth functions
Telescoping step
Limitation of GD, why use SGD
Batch size, convergence analysis of SGD
Optimization for training vs generalization, is the optimal necessary?
SVRG 5
SVRG analysis
Two layers NN 14/15
the relationship between optimization proof for neural network and the classical kernel method
Definition of Z matrix and H matrix
What’s the intuition
notcie whether consider a_i
Updating rule of f(x)-y
The implication and limitation of this type of analysis
Why H(t)≈H(0)
Convergence theorem
Matrix completion, no-convex optimization 23 24
Non-convex analysis for GD
Matrix completion
Problem formulation
collaborative filtering
Low rank
Known entries are uniformly distributed
Otherwise, we may find adversarial examples, e.g., some columns/rows are empty
Incoherence
How to solve?
minimizing rank(A) -- hard, optimizing the number of non-zero singular values
Alternative minimization
Escape saddle points
Main theorem, basic assumptions
assume all local min are equally good
Not required: how to prove the theorem
Escape local min: not required
Generalization Theory
No Free Lunch 10
Understand the related proofs
PAC Learning 11
ERM algorithm
Bayes optimal
Agnostic PAC learning
VC Dimension 11,12
Only need to understand the definition, and how to use it
Rademacher Complexity 12 13 14
definitions
Understand the proofs, and simple example
Use basic lemmas to upper bound Rademacher complexity
Other Topics
History of AI 1
Turing test
AI winters and AI booms
Reasons and lessons
Dataset Construction 2
ReCAPTCHA
Robust ML 20 21
Adv attack
Optimization formulation
FGSM
PGD algorithm
Adversarial training
Danskin’s theorem, what’s the implication?
Pick the maximizer of the inner loop for every iteration
Robust feature vs non-robust feature
Robust feature dataset
Cracking “robust” models
Commonly used techniques for creating falsely robust models
standard techniques for cracking them
Randomized smoothing technique and its proof (for ℓ_2, Gaussian)
Hyperparameters 21 22 23
Bayesian optimization (understand the procedure)
Gradient descent
How to store long chain of backpropagation
Using SGD with momentum term
Random search
Best arm identification analysis
Guarantee for SH
apply SH in hyperparameter tuning
Main idea of ProxylessNAS
How to automatically tune the depth?
What was the major challenge that this algorithm solved?
DP 26/27
Definition
Why randomness is essential?
If not, assume we have a non-trivial deterministic algorithm, suppose we have two databases A and B where our algorithm gives different outputs under some query, we can changing one row at a time from A to B. For the pair of databases differing by only a single row, the adversary knowing that the database is one of the two almost identical databases learns the value of that row.
Immunity to post-processing
Promise of DP mechanisms
Utility does not get much worse
Laplace mechanism, theoretical guarantees.
Interpretability 27
Lime algorithm and its limitation
limitation
Why do we need baseline
Fundamental axioms for attribution
IntegratedGrads algorithm
ML assistance Algorithm design 27
Learning augmented algorithm
When shall we use learning augmented algorithms?
Three types of learning augmented algorithms
Exponential search, why is it better than binary search?
Neural architectures
Stochastic depth
Resnet vs densenet
1 2 3 4 5 6 8 9 10 15 16 17 18 19 20 21 22 23 24 25