Table of Contents

Birth of Deep Learning

Birth of Deep Learning

Chapter Resources

Primary Resource: (Week 1 – Lecture: History, motivation, and evolution of Deep Learning)[https://youtu.be/0bMe_vCZo30] from Alfredo Canziani's Deep Learning (with PyTorch)

Core Concepts & Learnings

Technical Concepts & Learnings

History

The inspiration for Deep Learning is the brain.

In 1943 McCulloch & Pitts: In the article "A Logical Calculus of Ideas Immanent in Nervous Activity" proposed that a network of binary neurons can do logic
In 1949 Donald O Hebb: In his book "The Organization of Behavior" proposed a theory that said

Let us assume that the persistence or repetition of a reverberatory activity (or "trace") tends to induce lasting cellular changes that add to its stability. ... When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased.
In 1948 Norbert Wiener: In his book "Cybernetics: Or Control and Communication in the Animal and the Machine" proposed and promoted the ideas and concepts of cybernetics, optimal filter, feedback, autopsies, auto-organization among others.
In 1957 Frank Rosenblatt: In his tech review "The Perceptron: A Perceiving and Recognizing Automaton" introduced the perceptron. More detailed book "Principles of neurodynamics: perceptrons and the theory of brain mechanisms" was published in 1962.
In the 1960s Hubel & Wiesel: The classic experiments are fundamental to our understanding of how neurons along the visual pathway extract increasingly complex information from the pattern of light cast on the retina to construct an image. For one, they showed that there is a topographical map in the visual cortex that represents the visual field, where nearby cells process information from nearby visual fields. Moreover, their work determined that neurons in the visual cortex are arranged in a precise architecture. Cells with similar functions are organized into columns, tiny computational machines that relay information to a higher region of the brain, where a visual image is formed. (Description taken from Knowing Neurons' article "Hubel and Wiesel & the Neural Basis of Visual Perception") | "Receptive fields, binocular interaction and functional architecture in the cat's visual cortex" by D. H. Hubel and T. N. Wiesel

The history of Deep Learning starts in the 1940s but it undergoes a dark period from approximately 1968 to 1984 where most research on this subject was dropped due to an understanding that neural architectures can only do a limited set of operations (pattern recognition mostly) with not much scope of use. Research in Japan continued through these years though.

In 1985, the understanding of back-propagation reinvigorated the research in the field of deep learning and brought about a very important change in our utilization of neurons. Neurons used now were no longer step functions that are either on or off. They were continuous functions that provided a range of values.

This shift can also be contributed to the fact that in the 1960s multiplying floating point numbers was an extremely slow operation. So much so that it motivated researchers to avoid multiplication and thereby lead to a addition based (on/off) neuron and neural network.

There was another dark period from mid 1990s to about 2010. At this time the interest was reinvigorated by the ability of neural network to be effective in speech recognition. After this computer vision community picked up neural networks in about 2013. Later natural language processing picked up neural networks in around 2016. Now next branch to pick up neural networks is expected to be robotics and control systems.

Types of Learning

Supervised Learning

About 90 % of applications of Deep Learning and also a majority of applications of Machine Learning in general fall under the category of Supervised Learning. The core idea is that the programer does not imbibe any learning through code but rather a mechanism of supervising the learning by comparing against the correct answer. This basically leads the training mechanism to be something that tweaks the parameters (weights) of the model every time there is an incorrect answer and do nothing every time there is a correct answer. This is done until most answers are correct and the model has converged.

Unsupervised Learning

The availability of a correct answer may be not possible in some cases, due to ability to collect data, or due to ambiguous definition of correct answer. In such a case, traditional supervision of the model training is not possible. Any model training of this kind is called unsupervised learning.

Standard Paradigm of Pattern Recognition

The standard model of pattern recognition consists of two steps:

Feature extraction Input is provided to a feature extractor and the extractor identifies, extracts and quantifies relevant useful characteristics of the inputs. This could be identifying beats in a voice pattern, edges in an image, and so on. The output is generally a vector of features.
Trainable classifier The feature vector is provided to a classifier and the classifier (in case of neural networks) computes weighted sum. Classification is done based on the value of the weighted sum as compared to the threshold.

This model has a significant drawback. The quality of the feature extractor plays a very important role in the whole process and typically the feature extractor needs to be engineered by the programmer. Also, the feature extraction techniques tend to be domain specific and not reusable in another domain (or in many cases other problem statements of the same domain).

The same paradigm is used till date by 'traditional' machine learning pipelines.

Deep Learning and Back-Propagation

Deep learning takes the concept of the standard paradigm of pattern recognition and converts the feature extractor into a multiple layers of modules with tunable parameters (weights) and inherent non-linearity. Non-linearity is important to simply ensure we have the minimal set of parameters as a composition of two linear operations is another linear operation. Thus, 2 linear layers can always be reduced to a single linear layer.

If we have input $X_0$ then a linear layer (layer 1) can be expressed as a matrix operation as $X_1 = A X_0$ . If we have another linear layer described by the equation $X_2 = B X_1$ , then the composition of the two layers can simply be described as $X_2 = B A X_0$ . This can be reduced to $X_2 = C X_0$ such that matrix $C$ is defined by $C = B A$ . This reduction is not possible if layer 1 is a non-linear layer described by the equation $X_1 = \sigma \left( A X_0 \right)$ where $\sigma$ is an arbitrary non-linear function. Such a function is called activation function in deep learning terminology.

Gradient Descent

Any supervised learning can be envisioned as an optimization problem against the 'supervisor'. In this case, the supervisor is a function that provides an quantity describing the discrepancy or difference between the produced output and the actual expected value. This is called a objective function. The training of the model is simply changing the parameters of the model until this objective function is optimized (minimized).

The most common general mechanism of tuning the parameters for this type of an optimization is an iterative mechanism wherein we find the slope (or gradient) calculated by change in the value of objective function over a fixed change in the tunable parameter. We then take a step in the direction opposite of the steepest gradient. We repeat this activity until we cannot find any gradient in the direction we want.

This technique is called Stochastic Gradient Descent. The equation describing this technique is:

$W_i \leftarrow W_i - \eta \frac{\partial L \left(W, X \right)}{\partial W}$

Back-Propagation

Back propagation is a practical application of the chain rule of calculus. This gives us the ability to apply gradient descent over multiple layers. It is described by 2 sets of equations:

Equation for input gradients:

$\frac{\partial C}{\partial X_{i - 1}} = \frac{\partial C}{\partial X_{i}} \frac{\partial X_{i}}{\partial X_{i - 1}}$ $\frac{\partial C}{\partial X_{i - 1}} = \frac{\partial C}{\partial X_{i}} \frac{\partial \sigma_{i} \left( X_{i-1} w_i \right)}{\partial X_{i - 1}}$

Equation for weight gradients:

$\frac{\partial C}{\partial X_{i - 1}} = \frac{\partial C}{\partial X_{i}} \frac{\partial X_{i}}{\partial w_{i}}$ $\frac{\partial C}{\partial X_{i - 1}} = \frac{\partial C}{\partial X_{i}} \frac{\partial \sigma_{i} \left( X_{i-1} w_i \right)}{\partial w_{i}}$

Pattern Recognition in Deep Learning

Let us take an example of a image of resolution 256 X 256 pixels. This corresponds to 256256=65,536 input nodes. If we now make 1 layer of 1000 neurons then the number of weights connecting the input layer to this layer will be 655361000=65,536,000. If we need additional layers, this number will keep compounding. Storing and then training such a large number of weights soon turns impractical due to three reasons:

More number of weights means each cycle of training (iteration) takes longer to do.
More number of weights means more time (number of iterations) to find final weights.
More number of weights means more data needed to prevent overfit. As a general rule of thumb, with increase in number of weights by 1, we need double the amount of data.

Given this situation, in order to get a better understanding of how to manage such a large number of weights, we go back to the brain for inspiration. We look the research of Hubel & Wiesel, 1962 and other abstractions and research by Fukushima 1982 (1988 - Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition), LeCun 1989 and 1998, Riesenhuber 1999, Thorpe & Fabre-Thorpe 2001 amongst others. We learn that it's a multi stage process.

We have a few layers of neurons in our retinas. These are present in front of the photo receptors and pre-process the signal. One key role these play is that they also compress the information from hundreds of millions of neurons to about one million fibres.

Side Note: This is important as otherwise we would have a huge optical fibre coming out of our eyes and it would prevent the size of our face from being so small and also prevent movement of our eye balls. The side effect is that we have limits to our vision as the neurons are in front of the sensors and partially block the light. Also we have a blind spot as the fibres had to punch a hole in our retina in order to go from the front to the sensors to the brain which is behind them.

Side Note 2: Invertebrates do not have the same arrangement and their nerves are behind the eyes.
Based on the article "Seeking Categories in the Brain" by Simon J. Thorpe, Michele Fabre-Thorpe: Monkeys can categorize complex visual stimuli very quickly, with reaction times that average 250 to 260 ms but that can be as short as 180 ms. Depicted is a plausible route between the retina and the muscles of the hand during a categorization task. Information from the retina is relayed by the lateral geniculate nucleus of the thalamus (LGN) before reaching V1, the primary visual cortex. From there, processing continues in areas V2 and V4 of the ventral visual pathway before reaching visual areas in the posterior and anterior inferior temporal cortex (PIT and AIT), which contain neurons that respond specifically to certain objects. The inferior temporal cortex projects to a variety of areas, including the prefrontal cortex (PFC), which contains the visually responsive neurons that categorize objects (1). To reach the muscles in the hand, signals probably need to pass via the pre-motor cortex (PMC) and primary motor cortex (MC) before reaching the motor neurons of the spinal cord. For each processing stage, two numbers (in milliseconds) are given: The first is an estimate of the latency of the earliest neuronal responses to a flashed stimulus, whereas the second provides a more typical average latency.
Expanding on the previous points, we see that based on the timing, fast recognition is happening without any recurrence.
Further understanding tells us that:
1. Individual neurons only react to a small part of the visual field. This means the neurons are arranged in a retinotopic way.
2. A group of neurons that all react to the same visual field reacted to different patterns, for example one neuron would react to a vertical edge while another would react to a slightly tilted edge.
3. This can be represented as a block of neurons that react to different patterns in a small section of the visual field. These blocks are replicated to span the complete visual field.
4. These learnings help us to come up with the idea of convolution and pooling networks.

To Do

Gradient Descent
Back Propagation
Publication Reference
Publication review, abstraction and understanding
Retinotopy

Birth Of Deep Learning