导图社区 深度神经网络-Deep Neural Networks
深度神经网络总结,涵盖架构,激活函数,损失函数,细分领域的解决方案等
编辑于2020-06-22 08:26:03Deep Neural Networks
Evaluation Metrics
Classification
Definitions
True Positives : The cases in which we predicted YES and the actual output was also YES
False Positives : The cases in which we predicted YES and the actual output was NO
False Negatives : The cases in which we predicted NO and the actual output was YES.
True Negatives : The cases in which we predicted NO and the actual output was NO.
Concepts
Precision
Number of items correctly identified as positive out of total items identified as positive
TP/(TP+FP)
How accurate am when I say Positive
Recall / Sensitivity / True Positive Rate
Number of items correctly identified as positive out of total true positives
TP/(TP+FN)
How well can I tell Positive when it should be Positive
Type 1 Error / False Positive Rate
Number of items wrongly identified as positive out of total true negatives
FP / (FP +TN)
Type 2 Error / False Negative Rate
Number of items wrongly identified as negative out of total true positives
FN / (FN + TP)
Specificity / True Negative Rate
Number of items correctly identified as negative out of total negatives
TN / (TN +FP)
模型的好坏
高精度+高召回率:模型能够很好地检测该类
高精度+低召回率:模型不能很好地检测该类,但是在它检测到这个类时,判断结果是高度可信的
低精度+高召回率:模型能够很好地检测该类,但检测结果中也包含其他类的点
低精度+低召回率:模型不能很好地检测该类
Methods
Classification Accuracy
F1 Score
Precision和Recall的平均值,将两者放在一个metrics里
2×precision×recall / (precision + recall)
Confusion Matrix
Receiver Operating Characteristic (ROC) Curve
Sensitivity vs Specificity
Sensitivity: True Positive Rate TP / (FN + TP)
Specificity: True Negative Rate TN / (TN + FP)
Area Under Curve (AUC)
Regression
Metrics
RMSE
Root Mean Squared Error
MAE
Mean Absolute Error
R Squared
The numerator is MSE ( average of the squares of the residuals) and the denominator is the variance in Y values.
Adjusted R^2
Tips
RMSE >= MAE
Important distinction between MAE & RMSE
minimizing the squared error over a set of numbers results in finding its mean, and minimizing the absolute error results in finding its median
This is the reason why MAE is robust to outliers whereas RMSE is not
RMSE is still the default metrics
loss function defined in terms of RMSE is smoothly differentiable
Loss
Definitions
Loss function
Usually a function defined on a data point, prediction and label, and measures the penalty
Cost function
Usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization)
Objective function
most general term for any function that you optimize during training
Classification
Loss function for classification
Losses
Log Loss
a special case of cross entropy loss where the number of classes n=2
Focal Loss
Class imbalance issue in object detection
Added a weighted term in front of cross entropy
Cross Entropy Loss:
Binary Cross Entropy:
Binary Case
KL Divergence / Cross Entropy
Binary Cross-Entropy
Multi-class Cross-Entropy
KL Divergence
Exponential Loss
Hinge Loss
Hinge
Squared Hinge
Regression
Losses
Quantile Loss
Meas Square Error / Quadratic Loss / L2
MSE
L2
Mean Absolute Error / L1
MAE
L1
Huber Loss / Smooth MAE Loss
Approaches MAE when sigma ~ 0
Approaches MSE when sigma ~ large numbers
Log Cosh Loss
Smoother than L2
Mean Squared Logarithmic Error
usually used when you do not want to penalize huge differences in the predicted and the actual values when both predicted and true values are huge numbers
both predicted and actual values are small
MSE = MSLE
either predicted or the actual value is big
MSE > MSLE
both predicted and actual values are big
MSE > MSLE (MSLE is very small)
Key Takeways
MSE vs MAE (L2 vs L1)
MSE easier to solve, stable and closed form solution, but sensitive to outliers
MAE more robust to outliers
MAE useful when data has more outliers
Problem: gradient will be large even for small loss, derivatives not continouse
Huber loss
Less sensitive to outliers than MSE
Differentiable at 0 vs MAE
Problem: Choice of sigma is critical, need to find good hyperparameter
GAN
Losses
Original GAN
Jensen Shannon Divergence (JSD)
Problems
Vanishing gradient
Hard to achieve Nash equilibrium
Mode collapse
minibatch discrimination
Feature Matching
Improved GAN Training
WGAN
Earth-Mover (EM) distance (AKA Wasserstein distance)
Why Wasserstein is better than JS or KL divergence
Even when two distributions are located in lower dimensional manifolds without overlaps, Wasserstein distance can still provide a meaningful and smooth representation of the distance in-between.
Implementation
WGAN-GP
gradient penalty
LSGAN
matching their generated distribution to a real data distribution
Other
Cosine Proximity ?
Poisson
Temporal
CTC (Connectionist Temporal Classification)
Solves problem when don't have the alignment of each character in the sequence
Only requires an audio file (video) as input and a corresponding transcription
introducing a pseudo-character (blank)
when encoding a text, we can insert arbitrary many blanks at any position, which will be removed when decoding it
Loss calculation
The score for one alignment (or path, as it is often called in the literature) is calculated by multiplying the corresponding character scores together
To get the score for a given GT text, we sum over the scores of all paths corresponding to this text
E.g. for "a", sum over "aa"+"a-" +"-a"
Decoding
best path decoding (greedy)
it calculates the best path by taking the most likely character per time-step
removing first duplicate characters and then blanks
Duplicated letters must have a blank in between
beam-search decoding
prefix-search decoding
Architecture
Attention
Transformer
Image Transformer
Attention Type
Bahdanau Attention
For each input, repeat
1. Producing the Encoder Hidden States
Encoder produces hidden states of each element in the input sequence
2. Calculating Alignment Scores
between the previous decoder hidden state and each of the encoder’s hidden states are calculated
3. Softmaxing the Alignment Scores
the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed
4. Calculating the Context Vector
the encoder hidden states and their respective alignment scores are multiplied to form the context vector
5. Decoding the Output
the context vector is concatenated with the previous decoder output and fed into the Decoder RNN
Luong Attention
Difference to Bahdanau
calculating alignment score
position at which the Attention mechanism is being introduced in the decoder
Self-attention (intra-attention)
Soft vs Hard Attention
Global vs Local Attention
Global
use ALL encoder hidden states
Local
attends to only a few hidden states that fall within a smaller window
Problems
Memory consuming
Model is large
Input hidden state, context vector
Attention for CNN
Attention based CNN
RNN
Seq2seq
Encoder-Decoder
With Attention
Seq2seq problem
incapability of remembering long sentences
Beam search decoding?
CNN
Backbone
BlazeNet
Regression
Detector
Separable Bottleneck
MobileNet
V1
Bottleneck Residual Block
MobileNet V2
Key components
Inverted Residual Block
More memory efficient than residual block
Inverted Linear Residual Block
Last conv of block use linear output before added to the initial activations
ReLU6
guaranteed precision right of the decimal point
MobileNet V3
Notes
V1 vs V2
V1 residual block: wide -> narrow -> wide
V2 inverted: narrow -> wide -> narrow
ShuffleNet
ResNet
Ideas
Identity Shortcut Connections
Problem: with the network depth increasing,accuracy gets saturated and then degrades rapidly
Intuition:Deeper model shouldn't perform worse than itsshallow counterpart, because can just use an identity mappingto skip those added layers
Residual Block
ResNet101 vs ResNet50
Each ResNet block is either 2 layer deep (Used in small networks like ResNet 18, 34) or 3 layer deep( ResNet 50, 101, 152).
ResNetXt
Cardinality
the number of independent paths, to provide a new way of adjusting the model capacity
AlexNet
Layers
MaxPooling
MaxUnpooling
Fully connected
Bilinear Upsampling
Softmax
Convolution
Spatial Seperable Convolution
Although spatially separable convolutions save cost, it is rarely used in deep learning. One of the main reason is that not all kernels can be divided into two, smaller kernels. If we replace all traditional convolutions by the spatially separable convolution, we limit ourselves for searching all possible kernels during training. The training results may be sub-optimal.
Depthwise Conv2D
On 1 channel
Point-wise Conv2D
1x1 kernels
Depthwise Separable Conv
Depthwise + Pointwise
1) Spatial Convolution on each channel (depthwise)
2) Then point-wise convolution
Transposed Convolution (Deconvolution)
checkerboard artifacts
Conv2D
Strides
usually 2 for downsampling
Padding
how border is handled
Kernel Size
Input/Output Channels
Atrous conv / Dilated conv
large receptive filed without adding additional costs.
3D conv
Group Conv
AlexNet
allow the network training over two GPUs with limited memory (1.5 GB memory per GPU)
the filters are separated into different groups. Each group is responsible for a conventional 2D convolutions with certain depth
advantages
efficient training. Since the convolutions are divided into several paths, each path can be handled separately by different GPUs
model parameters decrease as number of filter group increases
Grouped convolution may provide a better model than a nominal 2D convolution
Extreme is depthwise convolution (1 for each group)
Shuffled Grouped Convolutions
Shuffle Net
Channel Suffling
Spatial Transformer
Components
Sampler
Grid Generator
Localization Net
Thin Plate Spline (TPS) Transformation
NAS
Autoencoder
Variational Autoencoder
their latent spaces are, by design, continuous, allowing easy random sampling and interpolation
making its encoder not output an encoding vector of size n, rather, outputting two vectors of size n: a vector of means, μ, and another vector of standard deviations, σ
even for the same input, while the mean and standard deviations remain the same, the actual encoding will somewhat vary on every single pass simply due to sampling
Loss
KL divergence
Intuitively, this loss encourages the encoder to distribute all encodings (for all types of inputs, eg. all MNIST numbers), evenly around the center of the latent space. If it tries to “cheat” by clustering them apart into specific regions, away from the origin, it will be penalized.
reconstruction loss
equilibrium reached by the cluster-forming nature of the reconstruction loss, and the dense packing nature of the KL loss, forming distinct clusters the decoder can decode
Conditional VAE
Loss
Reconstruction Loss
Problem with Standard Autoencoder
latent space they convert their inputs to and where their encoded vectors lie, may not be continuous, or allow easy interpolation.
But when you build a generative model, You want to randomly sample from the latent space, or generate variations on an input image, from a continuous latent space
Domains
CV
Detection
Two Stage Models
R-CNN
Selective Search
Resize each proposed region into the same size -> CNN ->Classifier
SPPNet
Spatial Pyramid Pooling
Fast R-CNN
ROI Pooling
End to end training
Multi-task loss
Faster R-CNN
Region Proposal Network (RPN)
Replaced Selective Search
Mask R-CNN
RoIAlign
Fully Convolutional Network(FCN) as second branch
One Stage Models
YOLOv1
SSD
YOLOv2
RetinaNet
Focal Loss
One stage detectors suffers from class imbalance
Anchor Free Models
Which is better?
One stage usually worse than Two stage
due to Class Imbalance
Selective Search or RPN can filter out many background samples
Focal Loss
Challenges
Segmentation
Instance Segmentation
Semantic Segmentation
Fully Convoluational Network (FCN)
PAMI 2016
end to end convolutional networks
Pretrain ImageNet Classification
Finetune on segmentation (Per pixel loss)
Upsample using deconvolutional layers
Introduce skip connections to improve over the coarseness of upsampling
Also pretrain on ImageNet
SegNet (2015)
Encoder-Decoder
Maxpooling indices transferred to decoder to improve the segmentation resolution
Use maxpooling indices instead of encoder features makes it more efficient than FCN
UNet
MICCAI 2015
Multi-Scale Context + Dilated Convoluation
ICLR 2016
Dilated Conv
Pooling layer for classification is not good for segmentation
Last two pooling layer (VGG) removed
Subsequent Conv Replaced by Dilated conv
exponential increase in field of view without decrease of spatial dimensions
Context Module
cascade of dilated convolutions of different dilations so that multi scale context is aggregated
Also pretrain on ImageNet
DeepLab V1&V2
PAMI 2017
Atrous Conv (Dilated Conv)
Atrous Spatial Pyramid Pooling
Fully connected CRF
Also pretrain on ImageNet
DeepLabV3
DeeplabV3+
FastFCN
Joint Pyramid Upsampling (JPU)
Replace Dilated Conv
Dialted Conv consume a lot of time and memory
RefineNet
PSPNet
Video Object Segmentation
GAN
Image Generation
GAN
DCGAN
Deep convolutional generative adversarial networks
AC-GAN
multiscale structural similarity (MS-SSIM)
multi-scale variant of a well-characterized perceptual similarity metric that attempts to discount aspects of an image that are not important for human perception
ProgressiveGAN
WGAN-GP
AC-GAN
StyleGAN
BigGAN
MSG-GAN
Multi-Scale Gradient GAN
cGAN
CycleGAN
Pix2Pix
PatchGAN
each pixel in the output layer maps to 70x70 patch in the input image
StarGAN
ACGAN (Auxiliary Classifier GAN
DAGAN (Data Augmentation GAN)
Style Transfer
Pose Estimation
Method
Heatmap
Regression
Metrics
PCK (Percentage of Correct Keypoints)
A detected joint is considered correct if the distance between the predicted and the true joint is within a certain threshold.
PCP Percentage of Correct Parts
A limb is considered detected (a correct part) if the distance between the two predicted joint locations and the true limb joint locations is less than half of the limb length (Commonly denoted as PCP@0.5)
Percentage of Detected Joints - PDJ
A detected joint is considered correct if the distance between the predicted and the true joint is within a certain fraction of the torso diameter
Object Keypoint Similarity (OKS) based mAP
Papers
DeepPose (CVPR 14)
L2 Regression
AlexNet Backbone
cascaded regressors
mages are cropped around the predicted joint and fed to the next stage
subsequent pose regressors see higher resolution images and thus learn features for finer scales which ultimately leads to higher precision
Efficient Object Localization Using Convolutional Networks (CVPR 15)
output is a discrete heatmap instead of continuous regression
A heatmap predicts the probability of the joint occurring at each pixel
additional convolutional model for fine-tuning
graphical model learns typical spatial relationships between joint
Convolutional Pose Machines (CVPR 16)
Tracking
DeepPursuit 2D Approach
3D Perception
Point Cloud
PointNet
Permutation (Order) Invariance
ordering of points doesn't impact the underlying geometry
A shared Multi-layer perceptron maps each point to 64 dimension feature
mapping is identical and independent on the n points
Transformation Invariance
rotation
scaling
translation
T-Net
apply an appropriate rigid or affine transformation to achieve pose normalization
applying a geometric transformation simply amounts to matrix multiplying each point with a transformation matrix
no local feature
PointNet++
hierarchical feature learning layer
local features
3D Voxels
Mesh
GraphCNN
Multi-Task Learning
Which Tasks Should Be Learned Together in Multi-task Learning?
Transfer Learning
Metric Learning
Reinforcement Learning
NLP
BERT
Regularization
L1
L2
Dropout
Weight regularization?
Bias regularization?
Data Augmentation
Early Stopping
Optimization
Backpropagation
Optimizer
BGD
batch gradient descent
have to calculate the cost for all training examples in the dataset
SGD
Momentum
converge faster
a temporal element into our equation for updating the parameters
find the direction from past directions, e.g. oscillate up and down the y-axis, but the real direction is going horizontal
SGD + Momentum
To speed up SGD
子主题
Adaptive Learning Rates
AdaGrad
RMSProp
Adam (Adaptive Moment Estimation)
uses Momentum and Adaptive Learning Rates
Normalization
Batch Norm
At inference time
no batch at inference
use population mean/variance
the law of large numbers
At training time
moving averaeg of the mean/variance
Before/After activation
Original paper: before activation
seems comparable/or better after activation according to many experiments
even Christian Szegedy now likes to perform BatchNorm after the ReLU
why not BN?
small batch size
Recurrent connections in an RNN
Instance Norm
network should be agnostic to the contrast of the original image
performs well on style transfer when replacing batch normalization
Spectral Normalization
GAN
VITON: An Image-based Virtual Try-on Network
Performance
Distillation
Quantization
Hardware
GPU
Activation Functions
Sigmoid
Real number -> [0, 1]
已经不用了
Saturate and kill gradient: 因为当输入的绝对值很大的时候,在尾部的gradient几乎是0,这样在backpropagation的时候就会因为chain rule相当于把它之前所有的节点都kill了。比如,如果一开始权值比较大时候所有节点都会saturate
Not zero centered
Similar to Neuron activation
Tanh
Real number -> [-1, 1]
Zero centered, but still have saturate problem as sigmoid
ReLU (Rectified Linear Unit)
优点
加速stochastic gradient descent的convergence速度
计算比前面两种简单,inexpensive
缺点
Fragile并且会die, 因为如果有过大的gradient flow through,那么这个节点可能就变成0,然后永远不会再更新
比如当learning rate过大时候,40%的节点都会die
选择适当的learning rate会一定程度上解决这个问题
ReLU6
the upper bound encouraged their model to learn sparse features earlier
Leaky ReLU
用来解决ReLU会die的问题,当x<0时用一个很小的negative slop
目前这个activation function的consistency还没有被验证
PreLU (Parameterized ReLU)
Maxout
一个非线性function,是ReLU和Leaky版本的推广
问题是double了系数的个数