- Minimum Width of Leaky-ReLU Neural Networks for Uniform Universal Approximation The study of universal approximation properties (UAP) for neural networks (NN) has a long history. When the network width is unlimited, only a single hidden layer is sufficient for UAP. In contrast, when the depth is unlimited, the width for UAP needs to be not less than the critical width w^*_{min}=max(d_x,d_y), where d_x and d_y are the dimensions of the input and output, respectively. Recently, cai2022achieve shows that a leaky-ReLU NN with this critical width can achieve UAP for L^p functions on a compact domain K, i.e., the UAP for L^p(K,R^{d_y}). This paper examines a uniform UAP for the function class C(K,R^{d_y}) and gives the exact minimum width of the leaky-ReLU NN as w_{min}=max(d_x+1,d_y)+1_{d_y=d_x+1}, which involves the effects of the output dimensions. To obtain this result, we propose a novel lift-flow-discretization approach that shows that the uniform UAP has a deep connection with topological theory. 4 authors · May 29, 2023
- Some Theoretical Results on Layerwise Effective Dimension Oscillations in Finite Width ReLU Networks We analyze the layerwise effective dimension (rank of the feature matrix) in fully-connected ReLU networks of finite width. Specifically, for a fixed batch of m inputs and random Gaussian weights, we derive closed-form expressions for the expected rank of the \mtimes n hidden activation matrices. Our main result shows that E[EDim(ell)]=m[1-(1-2/pi)^ell]+O(e^{-c m}) so that the rank deficit decays geometrically with ratio 1-2 / pi approx 0.3634. We also prove a sub-Gaussian concentration bound, and identify the "revival" depths at which the expected rank attains local maxima. In particular, these peaks occur at depths ell_k^*approx(k+1/2)pi/log(1/rho) with height approx (1-e^{-pi/2}) m approx 0.79m. We further show that this oscillatory rank behavior is a finite-width phenomenon: under orthogonal weight initialization or strong negative-slope leaky-ReLU, the rank remains (nearly) full. These results provide a precise characterization of how random ReLU layers alternately collapse and partially revive the subspace of input variations, adding nuance to prior work on expressivity of deep networks. 1 authors · Jul 10 1
- SAU: Smooth activation function using convolution with approximate identities Well-known activation functions like ReLU or Leaky ReLU are non-differentiable at the origin. Over the years, many smooth approximations of ReLU have been proposed using various smoothing techniques. We propose new smooth approximations of a non-differentiable activation function by convolving it with approximate identities. In particular, we present smooth approximations of Leaky ReLU and show that they outperform several well-known activation functions in various datasets and models. We call this function Smooth Activation Unit (SAU). Replacing ReLU by SAU, we get 5.12% improvement with ShuffleNet V2 (2.0x) model on CIFAR100 dataset. 4 authors · Sep 27, 2021
1 Mish: A Self Regularized Non-Monotonic Activation Function We propose Mish, a novel self-regularized non-monotonic activation function which can be mathematically defined as: f(x)=xtanh(softplus(x)). As activation functions play a crucial role in the performance and training dynamics in neural networks, we validated experimentally on several well-known benchmarks against the best combinations of architectures and activation functions. We also observe that data augmentation techniques have a favorable effect on benchmarks like ImageNet-1k and MS-COCO across multiple architectures. For example, Mish outperformed Leaky ReLU on YOLOv4 with a CSP-DarkNet-53 backbone on average precision (AP_{50}^{val}) by 2.1% in MS-COCO object detection and ReLU on ResNet-50 on ImageNet-1k in Top-1 accuracy by approx1% while keeping all other network parameters and hyperparameters constant. Furthermore, we explore the mathematical formulation of Mish in relation with the Swish family of functions and propose an intuitive understanding on how the first derivative behavior may be acting as a regularizer helping the optimization of deep neural networks. Code is publicly available at https://github.com/digantamisra98/Mish. 1 authors · Aug 23, 2019
- SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries Current Deep Network (DN) visualization and interpretability methods rely heavily on data space visualizations such as scoring which dimensions of the data are responsible for their associated prediction or generating new data features or samples that best match a given DN unit or representation. In this paper, we go one step further by developing the first provably exact method for computing the geometry of a DN's mapping - including its decision boundary - over a specified region of the data space. By leveraging the theory of Continuous Piece-Wise Linear (CPWL) spline DNs, SplineCam exactly computes a DNs geometry without resorting to approximations such as sampling or architecture simplification. SplineCam applies to any DN architecture based on CPWL nonlinearities, including (leaky-)ReLU, absolute value, maxout, and max-pooling and can also be applied to regression DNs such as implicit neural representations. Beyond decision boundary visualization and characterization, SplineCam enables one to compare architectures, measure generalizability and sample from the decision boundary on or off the manifold. Project Website: bit.ly/splinecam. 4 authors · Feb 24, 2023
- SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky ReLU, ELU, SELU) using a differentiable hard-mixture mechanism. In the second phase, each neuron's activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of varying depths. The analysis shows that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures. 1 authors · Oct 25
- QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also demonstrate that this model can be effectively fine-tuned on new datasets. 9 authors · Oct 22, 2019
1 Gompertz Linear Units: Leveraging Asymmetry for Enhanced Learning Dynamics Activation functions are fundamental elements of deep learning architectures as they significantly influence training dynamics. ReLU, while widely used, is prone to the dying neuron problem, which has been mitigated by variants such as LeakyReLU, PReLU, and ELU that better handle negative neuron outputs. Recently, self-gated activations like GELU and Swish have emerged as state-of-the-art alternatives, leveraging their smoothness to ensure stable gradient flow and prevent neuron inactivity. In this work, we introduce the Gompertz Linear Unit (GoLU), a novel self-gated activation function defined as GoLU(x) = x , Gompertz(x), where Gompertz(x) = e^{-e^{-x}}. The GoLU activation leverages the asymmetry in the Gompertz function to reduce variance in the latent space more effectively compared to GELU and Swish, while preserving robust gradient flow. Extensive experiments across diverse tasks, including Image Classification, Language Modeling, Semantic Segmentation, Object Detection, Instance Segmentation, and Diffusion, highlight GoLU's superior performance relative to state-of-the-art activation functions, establishing GoLU as a robust alternative to existing activation functions. 4 authors · Feb 5