searching for activation functios · 2020. 1. 11. · searching for activation functios. lehrstuhl...
TRANSCRIPT
Jenny Seidenschwarz
Technische Universität München
Seminar Course AutoML
Munich, 4th of July 2019
Searching for Activation Functios
Lehrstuhl für MusterverfahrenFakultät für MustertechnikTechnische Universität München
2Jenny Seidenschwarz (TUM) | Seminar Course AutoML
Activation Functions
• Gradient preserving property
• More easy to optimize
x1
xn
w1j
Linear
Trafo
w2j
wnj
x2 Activation
Function… … …
State of the art default scalar activation function: ReLU 𝒎𝒂𝒙(𝟎, 𝒙)
Figure: ReLU activation function
Figure: Basic structure neural network
3Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Research Goal
Find new scalar activation functions …
… using automated search technique …
… compare them systematically to existing activation functions …
… across multiple different challenging datasets!
Automated Search
Challenge: balance size and expressivity of search space
→ Simple binary expression tree [1]
→ Selection of unary and binary functions
5Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Search Space
x
x Unary
Unary
Binary
Binary
x
Unary: 𝑥, −𝑥, 𝑥 , 𝑥2, 𝑥3, 𝑥 , 𝛽𝑥, 𝑥 + 𝛽, log 𝑥 + 𝜖 ,
exp 𝑥 , sin 𝑥 , cos 𝑥 , sinh 𝑥 , cosh 𝑥 , tanh 𝑥 ,
𝑡𝑎𝑛−1 𝑥 , 𝑠𝑖𝑛ℎ−1 𝑥 , 𝑠𝑖𝑛𝑐 𝑥 ,max 0, 𝑥 ,
min 0, 𝑥 , 𝜎 𝑥 , log 1 + exp 𝑥 , exp −𝑥2 , erf x , β
BInary: 𝑥1 + 𝑥2, 𝑥1𝑥2, 𝑥1 − 𝑥2,𝑥1
𝑥2+𝜖, max(𝑥1, 𝑥2),
m𝑖𝑛 𝑥1, 𝑥2 , 𝜎 𝑥1 𝑥2, exp −𝛽(𝑥1 − 𝑥2)2 , exp(−𝛽|𝑥1 −
𝑥2|), β𝑥1 + (1 − 𝛽)𝑥2
core unit
Unary
Unary
Figure: Core Unit, adapted from [5]
6Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Search approach
small search space
big search space
exhaustive
search
reinforcement
learning-based
search
train child
network with
found
activation
functions
get list of best
performing
update search
algorithm and
find best
empirical
evaluation and
experiments
RNN-controller [2] with domain specific language [1]
Train batch of generated activation functions
• ResNet-20
• Image classification on CIFAR-10
• 10k steps
7Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
RNN-Controller
core unit
Figure: RNN-Controller architecture [5]
8Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
RNN-Controller update
RNN Controller with PPO:
→ Clipping ensures updates in „trust region“
→ Sample efficient
• Objective function
𝐿𝐶𝐿𝐼𝑃 𝜃 = 𝔼𝑡 min 𝜎𝑡𝐺𝑡 , 𝑐𝑙𝑖𝑝 𝜎𝑡 , 1 − 𝜀, 1 + 𝜀 𝐺𝑡 , 𝑤𝑖𝑡ℎ 𝜎𝑡 =𝜋𝜃 𝑎𝑡 𝑠𝑡)
𝜋𝜃𝑜𝑙𝑑 𝑎𝑡 𝑠𝑡)
RNN Controller with REINFORCE:
• Objective function ℒ𝜃𝑐 = 𝔼𝜋𝜃,𝜏𝐺𝑡 , where 𝐺𝑡 = σ𝑘=𝑡
𝑇−1 𝛾𝑘−𝑡𝑟𝑘+1
Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐
Figure: clipping function,
adapted from [3]
RNN Controller with PPO:
• Objective gradient
→ 𝐺𝑘 = accuracy of child network
→ 𝑏 = exponential moving average of rewards
RNN Controller with REINFORCE:
• Objective gradient
9Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
RNN-Controller update
Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐
One child network: ∇ℒ𝜃𝑐 = σ𝑡=1𝑇 ∇𝜃𝑐 𝑙𝑜𝑔 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 (𝐺 − 𝑏)
Figure: clipping function, adapted from [3]
Findings on Activation Functions
1. 1-2 core units perform best
2. Top activation functions always take raw preactivation x as input to final binary function
3. Periodic functions (sin, cos, etc. ) used by some top performing activation functions
4. Activation functions that use division perform poorly
11Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Findings on Activation Functions
𝑥𝜎 𝛽𝑥
𝑥(𝑠𝑖𝑛ℎ−1 𝑥 )2
min 𝑥, sin 𝑥
(𝑡𝑎𝑛ℎ−1(𝑥))2−𝑥
max(𝑥, 𝜎 𝑥 )
cos 𝑥 − 𝑥
m𝑎𝑥 𝑥, tanh(𝑥)
𝑠𝑖𝑛𝑐 𝑥 + 𝑥
Figure: best 8 activation functions [5]
Experiments to ensure generalization to deeper networks
12Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Validation of Performance
Swish
Figure: Generalization to deeper architectures of 8 best activation functions found [5]
(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy
• Nonlinearly interpolation between ReLU and linear function
• Smooth function
• Non-monotinoc function
• Unbounded above and bounded below (like ReLU)
13Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swish𝑓 𝑥 = 𝑥 𝜎(𝛽𝑥)
Figure: Swish activation function for different 𝛽 values
Benchmark of Swish
Benchmarked Swish to ReLU and other baseline activation functions
• Different models
• Different challenging real world datasets
• Test with fixed β = 1 and trainable β
15Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Further Experiments with Swish
CIFAR 10 and 100: ResNet-164, Wide ResNet 28-10 and DenseNet 100-12
• Median of 5 runs for comparison
ImageNet: Inception-ResNet-v2, Inception-v4, Inception-v3, MobileNet and Mobile NASNet-A
• Fixed number of steps, 3 learning rates with RMSProp
• Epsecially good performance on mobile sized modelm slightly underperform Inception-v4
English-German-translation: 12 layer Base Transformer
• Two different learning rates, 300K steps with Adam optimizer
16Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Further Experiments with Swish
Figure: Overview performace in experiments [5]
Performance of Swish
Learnable 𝛽:
18Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swish – learnable parameter β
Figure: distribution of trained 𝛽 on Mobile NASNet-A [5]
Non-monotinic bump for 𝑥 < 0
19Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swich – Challenging current Belief
Figure: Preactivations for 𝛽 = 1 on
ResNet-32 [5]
Figure: Swish function for different 𝛽
values with non-monotonic bump
No gradient preserving characteristics of derivative:
20Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swich – Challenging current Belief
Figure: Derivative of swish function for
different 𝛽 values
Figure: Derivative of ReLU
Main contributions:
• Used a search space as in [1] to find activation functions with a RNN controller [2], that was
updated with PPO [3]
• Systematically compared activation functions
• Found new activation function that constantly outperform or is on par with ReLU
Critical aspects:
• Search space restricts results
• Search space designed after human intuition
• Restriction of training steps and training on small architectures might suppress even better
activation functions
Future research:
• Only two core units, but more unary and binary functions
• Also take non-scalar activation functions into account
21Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Conclusion
[1] Bello, I. & Zoph, B. & Vasudevan, V. & Le, Quoc V. (2017). Neural Optimizer Search with
Reinforcement Learning.
[2] Zoph, B., & Le, Quoc V. (2017). Neural Architecture Search with Reinforcement Learning.
ArXiv, abs/1611.01578.
[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy
Optimization Algorithms. ArXiv, abs/1707.06347.
[4] Elfwing, S. & Uchibe, E. & Doya, K. (2018). Sigmoid-Weighted Linear Units for Neural
Network Function Approximation in Reinforcement Learning. Neural Networks. 107.
10.1016/j.neunet.2017.12.012.
[5] Ramachandran, P., Zoph, B., & Le, Q.V. (2018). Searching for Activation Functions. ArXiv,
abs/1710.05941.
22Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
References
Back-up
If you want to use Swish:
• Already implemented in tensorflow as tf.nn.swish(x)
• When using batch norm: set scale parameter
• Derivative of swish: 𝑓′ 𝑥 = 𝛽𝑓 𝑥 + 𝜎(𝛽𝑥)(1 − 𝛽𝑓 𝑥 )
24Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Things to note
25Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiment Results - CIFAR
Figure: Benchmark experiments of Swish function to baseline functions on CIFAR [5]
(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy
26Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on ImageNet
Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]
(a) Training curves of Mobile NASNet-Aon
ImageNet. Best viewed in color
(b) Mobile NASNet-A on ImageNet, with3 different
runs ordered by top-1 accuracy. Theadditional 2 GELU
experiments are still trainingat the time of submission.
27Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on ImageNet
Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]
(a) Inception-ResNet-v2 on ImageNetwith 3 different
runs. Note that the ELUsometimes has instabilities at
the start oftraining, which accounts for the first result
(b) MobileNet on ImageNet.
28Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on ImageNet
Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]
(a) Inception-v3 on ImageNet (b) Inception-v4 on ImageNet
29Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on Machine Translation
Figure: Benchmark experiments of Swish function to baseline functions on WTM
English→German (BLEU score) [5]