Alternating Direction Method of Multipliers (ADMM) has been used successfully in many conventional machine learning applications and is considered to be a useful alternative to Stochastic Gradient Descent (SGD) as a deep learning optimizer. However, as an emerging domain, several challenges remain, including

1) The lack of global convergence guarantees,

2) Slow convergence towards solutions, and

3) Cubic time complexity with regard to feature dimensions.

In this blog, I explain a novel optimization framework for deep learning via ADMM (dlADMM) to address these challenges simultaneously.

The parameters in each layer are updated backward and then forward so that the parameter information in each layer is exchanged efficiently. The time complexity is reduced from cubic to quadratic in (latent) feature dimensions via a dedicated algorithm design for subproblems that enhances them utilizing iterative quadratic approximations and backtracking. Finally, the first proof of global convergence for an ADMM-based method (dlADMM) in a deep neural network problem under mild conditions. Experiments on benchmark datasets demonstrated that proposed dlADMM algorithm outperforms most of the comparison methods.

1. The lack of global convergence guarantees.Despite the fact that many empirical experiments have shown that ADMM converges in deep learning applications, the underlying theory governing this convergence behavior remains mysterious. This is because a typical deep learning problem consists of a combination of linear and nonlinear mappings, causing optimization problems to be highly nonconvex. This means that traditional proof techniques cannot be directly applied.

2. Slow convergence towards solutions.Although ADMM is a powerful optimization framework that can be applied to large-scale deep learning applications, it usually converges slowly to high accuracy, even for simple datasets. It is often the case that ADMM becomes trapped in a modest solution and hence performs worse than SGD.

3. Cubic time complexity with regard to feature dimensions.The implementation of the ADMM is very time-consuming for real-world datasets. ADMM required more than 7000 cores to train a neural network with just 300 neurons. This computational bottleneck mainly originates from the matrix inversion required to update the weight parameters. Computing an inverse matrix needs further sub-iterations, and its time complexity is approximately O(n3), where n is a feature dimension.

In order to deal with these difficulties simultaneously, a new framework is devised for a deep learning Alternating Direction Method of Multipliers (dlADMM) algorithm. Specifically, dlADMM algorithm updates parameters first in a backward direction and then forwards. This update approach propagates parameter information across the whole network and accelerates the convergence process. It also avoids the operation of matrix inversion using the quadratic approximation and backtracking techniques, reducing the time complexity from O(n^3) to O(n^2).

Experiment Set-up

Dataset. In this experiment, two benchmark datasets were used for performance evaluation: MNIST and Fashion MNIST . The MNIST dataset has ten classes of handwritten-digit images. It contains 55,000 training samples and 10,000 test samples with 784 features each, which is provided by the Keras library. Unlike the MNIST dataset, the Fashion MNIST dataset has ten classes of assortment images on the website of Zalando, which is Europe’s largest online fashion platform. The Fashion-MNIST dataset consists of 60,000 training samples and 10,000 test samples with 784 features each.

Experiment Settings. We set up a network architecture which contained two hidden layers with 1, 000 hidden units each. The Rec- tified linear unit (ReLU) was used for the activation function for both network structures. The loss function was set as the deterministic cross-entropy loss. νwas set to 10−6.ρ was initialized as 10 and was multiplied by 10 every 100 iterations. The number of iteration was set to 200. In the experiment, one iteration means one epoch.

For fully-connected deep neural networks, SGD and its variants and ADMM are state- of-the-art methods and hence were served as comparison methods. For SGD-based methods, the full batch dataset is used for training models. All parameters were chosen by the accuracy of the training dataset.

Result

Fig 1. Convergence curves of dlADMM algorithm for MNIST and Fashion MNIST datasets when ρ = 1: dlADMM algorithm converged.

Fig 2. Divergence curves of the dlADMM algorithm for the MNIST and the Fashion MNIST datasets when ρ = 10−6: dlADMM algorithm diverged.

Convergence

Convergence.First, we show that our proposed dlADMM al- gorithm converges when ρ is sufficiently large and diverges when ρis small for both the MNIST dataset and the Fashion MNIST dataset.

The convergence and divergence of dlADMM algorithm are shown in Figures 2 and 3 when ρ = 1 and ρ = 10−6 ,respectively. In Figures 1(a) and 2(a), the X axis and Y axis denote the number of iterations and the logarithm of objective value, respectively. In Figures, 1(b) and 2(b), the X axis and Y axis denote the number of iterations and the logarithm of the residual, respectively. Figure 2, both the objective value and the residual decreased monotonically for the MNIST dataset and the Fashion-MNIST dataset, which validates the theoretical guarantees.

Conclusion

Firstly, the dlADMM updates parameters from backward to forward in order to transmit parameter information more efficiently. The time complexity is successfully reduced from O(n3) to O(n2) by iterative quadratic approximations and backtracking. Finally, the dlADMM is guaranteed to converge to a critical solution under mild conditions. Experiments on bench- mark datasets demonstrate that our proposed dlADMM algorithm outperformed most of the comparison methods.