The applications of deep learning are growing rapidly that it has made a great headway starting from computer vision to natural language processing. In many cases Deep Learning outperformed previous work. The elements that make up DL models — artificial neural networks (ANN) — is the driving force behind the subject.
Neural networks can perform tasks efficiently, especially those related to classification. This article elaborates on a research paper by noted computer scientists Geoffrey Hinton and Nicholas Frosst, where they go in-depth about how a neural net can be made into a soft decision tree.
Interpretation Through Decision Trees
In the paper Distilling a Neural Network Into a Soft Decision Tree, Hinton and Frosst explains why decision trees are preferred to make neural nets. “Hidden units in a neural net, a typical node at the lower levels of a decision tree is only used by a very small fraction of the training data so the lower parts of the decision tree tend to overfit unless the size of the training set is exponentially large compared with the depth of the tree.”
Mixture Of Experts and The Gradient Descent
To explain how neural nets perform better, decision tree model in Hinton and Frosst’s study uses gradient descent and relies on a machine learning technique called ‘mixture of experts’. This is to assign child nodes with a bias, a learned distribution and a weight. This setup will form a soft decision tree model, and the leaves with the highest probability.
In order to avoid incoherent solutions while training the model, a penalty term is also included. Penalty term consists of probability distributions across the nodes. Furthermore, a hyperparameter in the term strengthens the penalty. This penalty aspect reinforces the decreasing accuracy as the tree is descended from parent to child nodes.
Training On MNIST Dataset
When trained on a MNIST dataset, soft decision trees show less overfitting than neural nets, however, there is a slight dent on accuracy. Neural net with two convolutional hidden layers has a test accuracy of 99 percent while decision trees proved to be 94 percent accurate. When the same neural net is trained along ‘soft decision tree model’ as mentioned earlier, accuracy is 96 percent.
Decision trees rely on decisions rather than hierarchical features as evident in neural nets. After a layer or two in these networks, it is quite difficult to explain how the network behaves that way. There are many studies that have come up with possible explanations but fail to prove them with significant evidence.
If very large layers are present in the network, decision tree structure can also help with overfitting in addition to comprehending the large network. Ultimately, it is always important to check why certain models perform good or bad for specific tasks or use cases.
Hinton and Frosst use the neural network itself to train on a decision tree. If large amounts of data are encountered, poor statistical efficiency observed in decision trees is nullified by neural nets. An important point here, ‘soft decision trees’ are considered to train the neural net. Soft decision trees mean child nodes in a decision tree are assigned a certain probability. Also, all leaves contribute equally to the final decision in the tree.