导图社区深度神经网络-Deep Neural Networks

深度神经网络-Deep Neural Networks

深度神经网络总结，涵盖架构，激活函数，损失函数，细分领域的解决方案等

编辑于2020-06-22 08:26:03

颿

他的近期作品查看更多>>

深度神经网络-Deep Neural Networks

社区模板帮助中心，点此进入>>

颿

他的近期作品查看更多>>

相似推荐
大纲

互联网9大思维
- 33.1k
- 871
- 2.3k
- 383
MindMaster
组织架构-单商户商城webAPP 思维导图。
- 14.4k
- 3
- 182
- 12
Kacyun
域控上线
- 1.6k
- 159
- 11
- 4
jackrao
python思维导图
- 5.3k
- 511
- 241
- 8
(*^▽^*)
css
- 1.2k
- 1
- 43
- 3
A张舫
CSS
- 3.2k
- 256
- 188
- 33
journey
计算机操作系统思维导图
- 4.1k
- 319
- 202
- 18
journey
计算机组成原理
- 1.5k
- 97
- 70
- 8
journey
IMX6UL(A7)
- 502
- 38
- 4
- 1
Handler XU
考试学情分析系统
- 664
- 50
- 10
- 1
蒋龙

Deep Neural Networks

Evaluation Metrics

Classification

Definitions

True Positives : The cases in which we predicted YES and the actual output was also YES

False Positives : The cases in which we predicted YES and the actual output was NO

False Negatives : The cases in which we predicted NO and the actual output was YES.

True Negatives : The cases in which we predicted NO and the actual output was NO.

Concepts

Precision

Number of items correctly identified as positive out of total items identified as positive

TP/(TP+FP)

How accurate am when I say Positive

Recall / Sensitivity / True Positive Rate

Number of items correctly identified as positive out of total true positives

TP/(TP+FN)

How well can I tell Positive when it should be Positive

Type 1 Error / False Positive Rate

Number of items wrongly identified as positive out of total true negatives

FP / (FP +TN)

Type 2 Error / False Negative Rate

Number of items wrongly identified as negative out of total true positives

FN / (FN + TP)

Specificity / True Negative Rate

Number of items correctly identified as negative out of total negatives

TN / (TN +FP)

模型的好坏

高精度+高召回率：模型能够很好地检测该类

高精度+低召回率：模型不能很好地检测该类，但是在它检测到这个类时，判断结果是高度可信的

低精度+高召回率：模型能够很好地检测该类，但检测结果中也包含其他类的点

低精度+低召回率：模型不能很好地检测该类

Methods

Classification Accuracy

F1 Score

Precision和Recall的平均值，将两者放在一个metrics里

2×precision×recall / (precision + recall)

Confusion Matrix

Receiver Operating Characteristic (ROC) Curve

Sensitivity vs Specificity

Sensitivity: True Positive Rate TP / (FN + TP)

Specificity: True Negative Rate TN / (TN + FP)

Area Under Curve (AUC)

Regression

Metrics

RMSE

Root Mean Squared Error

MAE

Mean Absolute Error

R Squared

The numerator is MSE ( average of the squares of the residuals) and the denominator is the variance in Y values.

Adjusted R^2

Tips

RMSE >= MAE

Important distinction between MAE & RMSE

minimizing the squared error over a set of numbers results in finding its mean, and minimizing the absolute error results in finding its median

This is the reason why MAE is robust to outliers whereas RMSE is not

RMSE is still the default metrics

loss function defined in terms of RMSE is smoothly differentiable

Loss

Definitions

Loss function

Usually a function defined on a data point, prediction and label, and measures the penalty

Cost function

Usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization)

Objective function

most general term for any function that you optimize during training

Classification

Loss function for classification

Losses

Log Loss

a special case of cross entropy loss where the number of classes n=2

Focal Loss

Class imbalance issue in object detection

Added a weighted term in front of cross entropy

Cross Entropy Loss:

Binary Cross Entropy:

Binary Case

KL Divergence / Cross Entropy

Binary Cross-Entropy

Multi-class Cross-Entropy

KL Divergence

Exponential Loss

Hinge Loss

Hinge

Squared Hinge

Regression

Losses

Quantile Loss

Meas Square Error / Quadratic Loss / L2

MSE

Mean Absolute Error / L1

MAE

Huber Loss / Smooth MAE Loss

Approaches MAE when sigma ~ 0

Approaches MSE when sigma ~ large numbers

Log Cosh Loss

Smoother than L2

Mean Squared Logarithmic Error

usually used when you do not want to penalize huge differences in the predicted and the actual values when both predicted and true values are huge numbers

both predicted and actual values are small

MSE = MSLE

either predicted or the actual value is big

MSE > MSLE

both predicted and actual values are big

MSE > MSLE (MSLE is very small)

Key Takeways

MSE vs MAE (L2 vs L1)

MSE easier to solve, stable and closed form solution, but sensitive to outliers

MAE more robust to outliers

MAE useful when data has more outliers

Problem: gradient will be large even for small loss, derivatives not continouse

Huber loss

Less sensitive to outliers than MSE

Differentiable at 0 vs MAE

Problem: Choice of sigma is critical, need to find good hyperparameter

GAN

Losses

Original GAN

Jensen Shannon Divergence (JSD)

Problems

Vanishing gradient

Hard to achieve Nash equilibrium

Mode collapse

minibatch discrimination

Feature Matching

Improved GAN Training

WGAN

Earth-Mover (EM) distance (AKA Wasserstein distance)

Why Wasserstein is better than JS or KL divergence

Even when two distributions are located in lower dimensional manifolds without overlaps, Wasserstein distance can still provide a meaningful and smooth representation of the distance in-between.

Implementation

WGAN-GP

gradient penalty

LSGAN

matching their generated distribution to a real data distribution

Other

Cosine Proximity ?

Poisson

Temporal

CTC (Connectionist Temporal Classification)

Solves problem when don't have the alignment of each character in the sequence

Only requires an audio file (video) as input and a corresponding transcription

introducing a pseudo-character (blank)

when encoding a text, we can insert arbitrary many blanks at any position, which will be removed when decoding it

Loss calculation

The score for one alignment (or path, as it is often called in the literature) is calculated by multiplying the corresponding character scores together

To get the score for a given GT text, we sum over the scores of all paths corresponding to this text

E.g. for "a", sum over "aa"+"a-" +"-a"

Decoding

best path decoding (greedy)

it calculates the best path by taking the most likely character per time-step

removing first duplicate characters and then blanks

Duplicated letters must have a blank in between

beam-search decoding

prefix-search decoding

Architecture

Attention

Transformer

Image Transformer

Attention Type

Bahdanau Attention

For each input, repeat

1. Producing the Encoder Hidden States

Encoder produces hidden states of each element in the input sequence

2. Calculating Alignment Scores

between the previous decoder hidden state and each of the encoder’s hidden states are calculated

3. Softmaxing the Alignment Scores

the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed

4. Calculating the Context Vector

the encoder hidden states and their respective alignment scores are multiplied to form the context vector

5. Decoding the Output

the context vector is concatenated with the previous decoder output and fed into the Decoder RNN

Luong Attention

Difference to Bahdanau

calculating alignment score

position at which the Attention mechanism is being introduced in the decoder

Self-attention (intra-attention)

Soft vs Hard Attention

Global vs Local Attention

Global

use ALL encoder hidden states

Local

attends to only a few hidden states that fall within a smaller window

Problems

Memory consuming

Model is large

Input hidden state, context vector

Attention for CNN

Attention based CNN

RNN

Seq2seq

Encoder-Decoder

With Attention

Seq2seq problem

incapability of remembering long sentences

Beam search decoding?

CNN

Backbone

BlazeNet

Regression

Detector

Separable Bottleneck

MobileNet

Bottleneck Residual Block

MobileNet V2

Key components

Inverted Residual Block

More memory efficient than residual block

Inverted Linear Residual Block

Last conv of block use linear output before added to the initial activations

ReLU6

guaranteed precision right of the decimal point

MobileNet V3

Notes

V1 vs V2

V1 residual block: wide -> narrow -> wide

V2 inverted: narrow -> wide -> narrow

ShuffleNet

ResNet

Ideas

Identity Shortcut Connections

Problem: with the network depth increasing,accuracy gets saturated and then degrades rapidly

Intuition:Deeper model shouldn't perform worse than itsshallow counterpart, because can just use an identity mappingto skip those added layers

Residual Block

ResNet101 vs ResNet50

Each ResNet block is either 2 layer deep (Used in small networks like ResNet 18, 34) or 3 layer deep( ResNet 50, 101, 152).

ResNetXt

Cardinality

the number of independent paths, to provide a new way of adjusting the model capacity

AlexNet

Layers

MaxPooling

MaxUnpooling

Fully connected

Bilinear Upsampling

Softmax

Convolution

Spatial Seperable Convolution

Although spatially separable convolutions save cost, it is rarely used in deep learning. One of the main reason is that not all kernels can be divided into two, smaller kernels. If we replace all traditional convolutions by the spatially separable convolution, we limit ourselves for searching all possible kernels during training. The training results may be sub-optimal.

Depthwise Conv2D

On 1 channel

Point-wise Conv2D

1x1 kernels

Depthwise Separable Conv

Depthwise + Pointwise

1) Spatial Convolution on each channel (depthwise)

2) Then point-wise convolution

Transposed Convolution (Deconvolution)

checkerboard artifacts

Conv2D

Strides

usually 2 for downsampling

Padding

how border is handled

Kernel Size

Input/Output Channels

Atrous conv / Dilated conv

large receptive filed without adding additional costs.

3D conv

Group Conv

AlexNet

allow the network training over two GPUs with limited memory (1.5 GB memory per GPU)

the filters are separated into different groups. Each group is responsible for a conventional 2D convolutions with certain depth

advantages

efficient training. Since the convolutions are divided into several paths, each path can be handled separately by different GPUs

model parameters decrease as number of filter group increases

Grouped convolution may provide a better model than a nominal 2D convolution

Extreme is depthwise convolution (1 for each group)

Shuffled Grouped Convolutions

Shuffle Net

Channel Suffling

Spatial Transformer

Components

Sampler

Grid Generator

Localization Net

Thin Plate Spline (TPS) Transformation

NAS

Autoencoder

Variational Autoencoder

their latent spaces are, by design, continuous, allowing easy random sampling and interpolation

making its encoder not output an encoding vector of size n, rather, outputting two vectors of size n: a vector of means, μ, and another vector of standard deviations, σ

even for the same input, while the mean and standard deviations remain the same, the actual encoding will somewhat vary on every single pass simply due to sampling

Loss

KL divergence

Intuitively, this loss encourages the encoder to distribute all encodings (for all types of inputs, eg. all MNIST numbers), evenly around the center of the latent space. If it tries to “cheat” by clustering them apart into specific regions, away from the origin, it will be penalized.

reconstruction loss

equilibrium reached by the cluster-forming nature of the reconstruction loss, and the dense packing nature of the KL loss, forming distinct clusters the decoder can decode

Conditional VAE

Loss

Reconstruction Loss

Problem with Standard Autoencoder

latent space they convert their inputs to and where their encoded vectors lie, may not be continuous, or allow easy interpolation.

But when you build a generative model, You want to randomly sample from the latent space, or generate variations on an input image, from a continuous latent space

Domains

Detection

Two Stage Models

R-CNN

Selective Search

Resize each proposed region into the same size -> CNN ->Classifier

SPPNet

Spatial Pyramid Pooling

Fast R-CNN

ROI Pooling

End to end training

Multi-task loss

Faster R-CNN

Region Proposal Network (RPN)

Replaced Selective Search

Mask R-CNN

RoIAlign

Fully Convolutional Network(FCN) as second branch

One Stage Models

YOLOv1

SSD

YOLOv2

RetinaNet

Focal Loss

One stage detectors suffers from class imbalance

Anchor Free Models

Which is better?

One stage usually worse than Two stage

due to Class Imbalance

Selective Search or RPN can filter out many background samples

Focal Loss

Challenges

Segmentation

Instance Segmentation

Semantic Segmentation

Fully Convoluational Network (FCN)

PAMI 2016

end to end convolutional networks

Pretrain ImageNet Classification

Finetune on segmentation (Per pixel loss)

Upsample using deconvolutional layers

Introduce skip connections to improve over the coarseness of upsampling

Also pretrain on ImageNet

SegNet (2015)

Encoder-Decoder

Maxpooling indices transferred to decoder to improve the segmentation resolution

Use maxpooling indices instead of encoder features makes it more efficient than FCN

UNet

MICCAI 2015

Multi-Scale Context + Dilated Convoluation

ICLR 2016

Dilated Conv

Pooling layer for classification is not good for segmentation

Last two pooling layer (VGG) removed

Subsequent Conv Replaced by Dilated conv

exponential increase in field of view without decrease of spatial dimensions

Context Module

cascade of dilated convolutions of different dilations so that multi scale context is aggregated

Also pretrain on ImageNet

DeepLab V1&V2

PAMI 2017

Atrous Conv (Dilated Conv)

Atrous Spatial Pyramid Pooling

Fully connected CRF

Also pretrain on ImageNet

DeepLabV3

DeeplabV3+

FastFCN

Joint Pyramid Upsampling (JPU)

Replace Dilated Conv

Dialted Conv consume a lot of time and memory

RefineNet

PSPNet

Video Object Segmentation

GAN

Image Generation

GAN

DCGAN

Deep convolutional generative adversarial networks

AC-GAN

multiscale structural similarity (MS-SSIM)

multi-scale variant of a well-characterized perceptual similarity metric that attempts to discount aspects of an image that are not important for human perception

ProgressiveGAN

WGAN-GP

AC-GAN

StyleGAN

BigGAN

MSG-GAN

Multi-Scale Gradient GAN

cGAN

CycleGAN

Pix2Pix

PatchGAN

each pixel in the output layer maps to 70x70 patch in the input image

StarGAN

ACGAN (Auxiliary Classifier GAN

DAGAN (Data Augmentation GAN)

Style Transfer

Pose Estimation

Method

Heatmap

Regression

Metrics

PCK (Percentage of Correct Keypoints)

A detected joint is considered correct if the distance between the predicted and the true joint is within a certain threshold.

PCP Percentage of Correct Parts

A limb is considered detected (a correct part) if the distance between the two predicted joint locations and the true limb joint locations is less than half of the limb length (Commonly denoted as PCP@0.5)

Percentage of Detected Joints - PDJ

A detected joint is considered correct if the distance between the predicted and the true joint is within a certain fraction of the torso diameter

Object Keypoint Similarity (OKS) based mAP

Papers

DeepPose (CVPR 14)

L2 Regression

AlexNet Backbone

cascaded regressors

mages are cropped around the predicted joint and fed to the next stage

subsequent pose regressors see higher resolution images and thus learn features for finer scales which ultimately leads to higher precision

Efficient Object Localization Using Convolutional Networks (CVPR 15)

output is a discrete heatmap instead of continuous regression

A heatmap predicts the probability of the joint occurring at each pixel

additional convolutional model for fine-tuning

graphical model learns typical spatial relationships between joint

Convolutional Pose Machines (CVPR 16)

Tracking

DeepPursuit 2D Approach

3D Perception

Point Cloud

PointNet

Permutation (Order) Invariance

ordering of points doesn't impact the underlying geometry

A shared Multi-layer perceptron maps each point to 64 dimension feature

mapping is identical and independent on the n points

Transformation Invariance

rotation

scaling

translation

T-Net

apply an appropriate rigid or affine transformation to achieve pose normalization

applying a geometric transformation simply amounts to matrix multiplying each point with a transformation matrix

no local feature

PointNet++

hierarchical feature learning layer

local features

3D Voxels

Mesh

GraphCNN

Multi-Task Learning

Which Tasks Should Be Learned Together in Multi-task Learning?

Transfer Learning

Metric Learning

Reinforcement Learning

NLP

BERT

Regularization

Dropout

Weight regularization?

Bias regularization?

Data Augmentation

Early Stopping

Optimization

Backpropagation

Optimizer

BGD

batch gradient descent

have to calculate the cost for all training examples in the dataset

SGD

Momentum

converge faster

a temporal element into our equation for updating the parameters

find the direction from past directions, e.g. oscillate up and down the y-axis, but the real direction is going horizontal

SGD + Momentum

To speed up SGD

子主题

Adaptive Learning Rates

AdaGrad

RMSProp

Adam (Adaptive Moment Estimation)

uses Momentum and Adaptive Learning Rates

Normalization

Batch Norm

At inference time

no batch at inference

use population mean/variance

the law of large numbers

At training time

moving averaeg of the mean/variance

Before/After activation

Original paper: before activation

seems comparable/or better after activation according to many experiments

even Christian Szegedy now likes to perform BatchNorm after the ReLU

why not BN?

small batch size

Recurrent connections in an RNN

Instance Norm

network should be agnostic to the contrast of the original image

performs well on style transfer when replacing batch normalization

Spectral Normalization

GAN

VITON: An Image-based Virtual Try-on Network

Performance

Distillation

Quantization

Hardware

GPU

Activation Functions

Sigmoid

Real number -> [0, 1]

已经不用了

Saturate and kill gradient: 因为当输入的绝对值很大的时候，在尾部的gradient几乎是0,这样在backpropagation的时候就会因为chain rule相当于把它之前所有的节点都kill了。比如，如果一开始权值比较大时候所有节点都会saturate

Not zero centered

Similar to Neuron activation

Tanh

Real number -> [-1, 1]

Zero centered, but still have saturate problem as sigmoid

ReLU (Rectified Linear Unit)

优点

加速stochastic gradient descent的convergence速度

计算比前面两种简单,inexpensive

缺点

Fragile并且会die, 因为如果有过大的gradient flow through，那么这个节点可能就变成0,然后永远不会再更新

比如当learning rate过大时候，40%的节点都会die

选择适当的learning rate会一定程度上解决这个问题

ReLU6

the upper bound encouraged their model to learn sparse features earlier

Leaky ReLU

用来解决ReLU会die的问题，当x<0时用一个很小的negative slop

目前这个activation function的consistency还没有被验证

PreLU (Parameterized ReLU)

Maxout

一个非线性function，是ReLU和Leaky版本的推广

问题是double了系数的个数