Artificial Intelligence is a very “hot” field in Computer
Science right now. The main aim is to teach a computer to think and act smart.
Machine Learning is a field that gives computers the ability to learn new
“concepts” on its own. To explain why this is important, let’s consider a task
which is trivial for humans. If I show you a picture of a cat, you can
instantly recognize that it’s a cat. But it turns out that this a difficult
task for a computer to perform. So the task is, given a set of images, find out
which category (car, dog, cat, …) the image belongs to. This task is difficult for a computer
because:

A computer sees an image as a bunch of numbers (as shown
below) and we want the computer to give us some estimate of what it thinks
about the given image. This task is called image classification.

(image from http://cs231n.github.io/ : stanford course on visual recognition)

Other issues:

Other issues:

As the illumination changes, meaning if the image is more
bright/dull, those numbers shown above change.There might be some objects hiding some details of the object
in the image as shown below. (Plot Twist: it’s a dog)

(image from http://cs231n.github.io/ : stanford course on visual recognition)

Another problem is, if you look very
closely to the image shown below, you can convince yourself that both of the
images belong to “cat” category. So you can get a general idea why the task is
difficult for a computer, not because the cats shown here are retarded but
because these images even though belonging to the same category have different
representations when seen by a computer (the numbers above are different for all the images basically).

**Machine Learning Background**

Few concepts are needed before moving to Deep Networks.

There is no straightforward way to just do this, it’s not
like arranging a few numbers say 2,1,4,3 as 1,2,3,4. Instead there are
“concepts” we want the computer to learn on its own.

But you might be thinking, how can we make a computer learn
this by itself. To understand this process, imagine you are teaching a kid how
to distinguish between an Oreo and an orange, and the kid is the stupidest kid
you have ever met. You tell him that if the item has two chocolate wafers with
a white creme filling in between it is likely to be an Oreo and in the same way
can describe an orange. Basically you are providing “features” i.e. {“two
chocolate wafers”, “white creme filling”} for Oreo and {“Orange Color”,
“Circular”} for orange to help him understand the difference.

But you would be thinking, Jay, you just said that computers
see an image as numbers, how will a computer get to know if there are two
chocolate wafers, white creme. You are absolutely correct. But, it is important to first introduce some fancy terms which people normally use.

**Linear Hypothesis:**Consider the above oreo-orange example. Imagine if I were to give you two numbers describing an oreo and two numbers describing an orange. These two numbers are the “features” which I mentioned earlier but represented such that a computer understands them (remember the image is a bunch of numbers and we selected two numbers somehow which describes each category). In practice, we think of all the numbers representing the image as features, but consider two numbers for now. As we have two numbers, I can plot them considering first number as ‘x’ coordinate and second number as ‘y’ coordinate. Assuming no two oreos/orange are same, we will end up with different values for the two numbers. Think of it as the other two oreos are deformed at different positions and orange may have different sizes.

As shown above if we plot the oreos and oranges, we see that
we can separate them using a straight line i.e. anything below the red line is
an oreo and above is an orange. This is called “Linear Classification” and
function doing this is a linear hypothesis. And a piece of computer code which
does this Classification is called a classifier.

Just to express the notion more precisely, the linear
hypothesis comes up with a function like h(x) = w

_{0}+ w_{1}x_{1}+ w_{2}x_{2}(just a fancy way of writing a straight line), where {x_{1, }x_{2}} are the features and {w_{0}, w_{1}, w_{2}} are the parameters we wish to learn from our dataset.
Two most used classifiers are Support Vector Machine (SVM)
and Softmax Classifier.

**Non-Linear Hypothesis:**You might have figured this out by now, life is not fair. In practice, we would want something like a curve to separate categories.

The red curve separates the two classes.

In order to learn a non-linear function, we need to
introduce non-linearity, which gives us this kind of a curve. One example of a
hypothesis can be a function like h(x) = w

_{0}+ w_{1}x_{1}^{2}+ w_{2}x_{1}x_{2 }+ w_{3}x_{2}^{2}, where {x_{1, }x_{2}} are the features and {w_{0}, w_{1}, w_{2}, w_{3}} are the parameters we wish to learn from our dataset. This is a very simple non-linear hypothesis which just contains second order terms.**Learning Parameters:**Before describing the process, introduction of two more terms is required:

**Score Function**: This maps the raw data (the numbers in our image) to a score of a category i.e. score for it being an oreo/orange.

**Loss Function**: This function describes our dissatisfaction with the output, for example if we were excepting the score of orange category to be 1(we know that the item is an orange), but it turned out to be just 0.2, so we are kind of pissed off with this result. This function measures how dissatisfied we are with the output. So we want to make the value of the parameters better so that this score increases. This is the Learning part.

So to learn parameters in our hypothesis, we try to find
values of parameters k

_{i }such that it minimizes our dissatisfaction with the output, or in other words minimizes the loss function. We do this for all the examples in our dataset. One algorithm which does this is called**gradient descent**. If you don’t know much about Machine Learning, just keep this in mind, everything eventually comes down to gradient descent.

**Neural Networks:**To learn the non-linear hypothesis, people usually model the problem using neural networks. Research shows that our brain has one single learning algorithm, so we can also think about just using one model that does everything, instead of having a lot of components. Our brain has a large number of computational units called neurons. You might have seen the image below in high school.

(image from http://cs231n.github.io/ : stanford course on visual recognition)

Basically, Dendrites take in the inputs, neuron performs
computation on the input and sends a spike (output) along the axon, if it input
signal excites (activates) the neuron.

Now let’s take this biological process and map it to a
computational one. We have features x

_{i }as input and w_{i}are the parameters (also referred to as weights). Now a neuron can be activated if the output of a neuron satisfies something called as an activation function. One example of such activation function might be w_{1}x_{1}+ w_{2}x_{2 + }w3x3 + b≥ 0.5, i.e. if all inputs along with their weights contribute a quantity with value more than 0.5, then the neuron will fire/send output, otherwise it will not send any output. ‘b’ in the previous equation is equivalent to w_{0 }and it is known as bias, which is another parameter.
(image from http://cs231n.github.io/ : stanford course on visual recognition)

Common activation functions are sigmoid function, tanh, ReLU
(Rectified Linear Unit). Sigmoid and tanh functions have drawbacks, ReLU is a
new standard now. ReLU's job is to see if the output of the neuron is more than 0, send let the output pass, otherwise the neuron has zero ouput, basically a max function max(0,X) , where X is the output of the neuron.

The activation function discussed above provide the
non-linearity we need for our hypothesis.

(image from http://cs231n.github.io/ : stanford course on visual recognition)

The image above shows a sample neural network. Neural
Networks are organized in layers, with each layer containing multiple neuron.
It has one input layer which contains our input features, and output layer
which predicts the category (oreo/orange), and any layer in the middle is
called a hidden layer. There can be multiple hidden layers. Each neuron (the
small circles within each layers) performs the operation we discussed above. Note
that each neuron is connected to all the neurons in the previous layer, but not
within the layer. So now you can think of neural network as, the hidden layer
taking input features and performing complex operations on the input and then
generating output for each output category. The output of one neuron goes as
input to the neuron in the next layer. This small network has 20 weights + 2
biases = 22 total parameters.

These parameters are learnt in following way:

There are two passes through the network, forward pass and
backpropagation

Forward Propogation – basically computes the category score
for each category, remember score function

Backpropagation – to minimize the loss function, we need to
perform gradient descent. In order to perform gradient descent, we need to
compute gradients using backpropagation. Just remember that this step is required
and no need of going into the details now.

So if you stack up one more hidden layer, one can expect
neural network to learn even more complex non-linear functions. But after
adding than 5-6 hidden layers the network does not improve a lot, but this is
not the case with deep networks as discussed later.

**Back to image classification:**

In computer vision, we consider each pixel value to be a
feature. Imagine if you are given an 200x200 size image. Each pixel has three
values: Red, Green and Blue (RGB) values. So we have totally 200*200*3 = 120,000
weights for each neuron in the first hidden layer. And the hidden layer will
have a lot more neurons, you can see how this network would have millions of
parameters in the first hidden layer. This architecture is very wasteful and
has a lot of parameters, which would quickly lead to “overfitting”. In short do
not use this architecture.

**Convolutional Neural Networks(CNNs)**

CNNs are designed especially for images. These networks have
the same forward and backward propagation as the normal neural nets, but are
less wasteful. These networks now have a special type of layer called
Convolution layer, in which each neuron looks at a different window/local
region in the image. All these neurons share the same weights, so you can see
how parameter sharing reduces the total number of parameters. A convolutional layer is shown below. Where the
pink layer is the input volume, with 32x32 image and 3 channels – Red, Green
and Blue.

The features learnt by these networks are very good compared to the normal neural nets, the first few layers learn generic features (eg. edges) and as we go deeper the layers learn features pertaining to the training dataset.

The features learnt by these networks are very good compared to the normal neural nets, the first few layers learn generic features (eg. edges) and as we go deeper the layers learn features pertaining to the training dataset.

(image from http://cs231n.github.io/ : stanford course on visual recognition)

**Architecture**: CNNs have different types of layers – Convolutional layer as discussed above, RELU – adds non-linearity, Pooling layer which downsamples the output to reduce the number of parameters, fully connected layer – same as the ones in ordinary neural network where each neuron is connected to all the neurons in the previous layer.

**Rise of Deep Learning:**

Few reasons are:

1) With Networks like CNN, it is observed that the more layers you add the better your performance is

2) Deep Networks learn features which are very good and hence they give better performance

3) With computing power delivered through GPUs it has become easier to train deeper networks

Andrew Ng explains basics of the rise of deep learning through this graph below:

So now everyone is trying to train networks which are very
deep to gain more accuracy. With Residual learning researchers have been able
to train a network which is 152 layers deep.

It turns out that deep networks are able to learn better
features to differentiate between categories, and that is the reason why
everyone is talking about deep nets.

Here are #parameters used in some famous deep learning
architectures:

AlexNet – First work which popularized CNNs in computer vision.
It has 40 M parameters

VGGNet – 140 M

With ResNet – Residual Networks, as mentioned above, they were able to train 152
layers, which is 8 times deeper than VGGNets, i.e. around 1.12 Billion parameters

Hence, you need a large amount of data to estimate these parameters. Just imagine finding 1.12 billion numbers, which also end up minimizing our loss function!

Problems with just stacking layers:

It is observed that if one just keeps on stacking layer after another the training error increases with the total number of layers. The network has to learn more complex functions. This problem is address by the authors of ResNets