Saturday, June 25, 2016

Rise of Deep Nets

Artificial Intelligence is a very “hot” field in Computer Science right now. The main aim is to teach a computer to think and act smart. Machine Learning is a field that gives computers the ability to learn new “concepts” on its own. To explain why this is important, let’s consider a task which is trivial for humans. If I show you a picture of a cat, you can instantly recognize that it’s a cat. But it turns out that this a difficult task for a computer to perform. So the task is, given a set of images, find out which category (car, dog, cat, …) the image belongs to.  This task is difficult for a computer because:
A computer sees an image as a bunch of numbers (as shown below) and we want the computer to give us some estimate of what it thinks about the given image. This task is called image classification.

                 (image from : stanford course on visual recognition)
Other issues:
As the illumination changes, meaning if the image is more bright/dull, those numbers shown above change.There might be some objects hiding some details of the object in the image as shown below. (Plot Twist: it’s a dog)

                   (image from : stanford course on visual recognition)

 Another problem is, if you look very closely to the image shown below, you can convince yourself that both of the images belong to “cat” category. So you can get a general idea why the task is difficult for a computer, not because the cats shown here are retarded but because these images even though belonging to the same category have different representations when seen by a computer (the numbers above are different for all the images basically). 

Machine Learning Background

Few concepts are needed before moving to Deep Networks.

There is no straightforward way to just do this, it’s not like arranging a few numbers say 2,1,4,3 as 1,2,3,4. Instead there are “concepts” we want the computer to learn on its own.

But you might be thinking, how can we make a computer learn this by itself. To understand this process, imagine you are teaching a kid how to distinguish between an Oreo and an orange, and the kid is the stupidest kid you have ever met. You tell him that if the item has two chocolate wafers with a white creme filling in between it is likely to be an Oreo and in the same way can describe an orange. Basically you are providing “features” i.e. {“two chocolate wafers”, “white creme filling”} for Oreo and {“Orange Color”, “Circular”} for orange to help him understand the difference.

But you would be thinking, Jay, you just said that computers see an image as numbers, how will a computer get to know if there are two chocolate wafers, white creme. You are absolutely correct. But, it is important to first introduce some fancy terms which people normally use.  

Linear Hypothesis: Consider the above oreo-orange example. Imagine if I were to give you two numbers describing an oreo and two numbers describing an orange. These two numbers are the “features” which I mentioned earlier but represented such that a computer understands them (remember the image is a bunch of numbers and we selected two numbers somehow which describes each category). In practice, we think of all the numbers representing the image as features, but consider two numbers for now.  As we have two numbers, I can plot them considering first number as ‘x’ coordinate and second number as ‘y’ coordinate. Assuming no two oreos/orange are same, we will end up with different values for the two numbers. Think of it as the other two oreos are deformed at different positions and orange may have different sizes.

As shown above if we plot the oreos and oranges, we see that we can separate them using a straight line i.e. anything below the red line is an oreo and above is an orange. This is called “Linear Classification” and function doing this is a linear hypothesis. And a piece of computer code which does this Classification is called a classifier.

Just to express the notion more precisely, the linear hypothesis comes up with a function like h(x) = w0 + w1x1 + w2x2 (just a fancy way of writing a straight line), where {x1, x2} are the features and {w0, w1, w2} are the parameters we wish to learn from our dataset.
Two most used classifiers are Support Vector Machine (SVM) and Softmax Classifier.

Non-Linear Hypothesis: You might have figured this out by now, life is not fair. In practice, we would want something like a curve to separate categories.

The red curve separates the two classes.

In order to learn a non-linear function, we need to introduce non-linearity, which gives us this kind of a curve. One example of a hypothesis can be a function like h(x) = w0 + w1x12 + w2x1x2 + w3x22, where {x1, x2} are the features and {w0, w1, w2, w3} are the parameters we wish to learn from our dataset. This is a very simple non-linear hypothesis which just contains second order terms.

Learning Parameters: Before describing the process, introduction of two more terms is required:

Score Function: This maps the raw data (the numbers in our image) to a score of a category i.e. score for it being an oreo/orange.

Loss Function: This function describes our dissatisfaction with the output, for example if we were excepting the score of orange category to be 1(we know that the item is an orange), but it turned out to be just 0.2, so we are kind of pissed off with this result. This function measures how dissatisfied we are with the output. So we want to make the value of the parameters better so that this score increases. This is the Learning part.

So to learn parameters in our hypothesis, we try to find values of parameters ki such that it minimizes our dissatisfaction with the output, or in other words minimizes the loss function. We do this for all the examples in our dataset. One algorithm which does this is called gradient descent. If you don’t know much about Machine Learning, just keep this in mind, everything eventually comes down to gradient descent.

Neural Networks: To learn the non-linear hypothesis, people usually model the problem using neural networks. Research shows that our brain has one single learning algorithm, so we can also think about just using one model that does everything, instead of having a lot of components.  Our brain has a large number of computational units called neurons. You might have seen the image below in high school.  
                    (image from : stanford course on visual recognition)

Basically, Dendrites take in the inputs, neuron performs computation on the input and sends a spike (output) along the axon, if it input signal excites (activates) the neuron.

Now let’s take this biological process and map it to a computational one. We have features xi as input and wi are the parameters (also referred to as weights). Now a neuron can be activated if the output of a neuron satisfies something called as an activation function. One example of such activation function might be  w1x1 + w2x2 + w3x3 + b≥ 0.5, i.e. if all inputs along with their weights contribute a quantity with value more than 0.5, then the neuron will fire/send output, otherwise it will not send any output. ‘b’ in the previous equation is equivalent to w0 and it is known as bias, which is another parameter.
                    (image from : stanford course on visual recognition)

Common activation functions are sigmoid function, tanh, ReLU (Rectified Linear Unit). Sigmoid and tanh functions have drawbacks, ReLU is a new standard now. ReLU's job is to see if the output of the neuron is more than 0, send let the output pass, otherwise the neuron has zero ouput, basically a max function max(0,X) , where X is the output of the neuron.

The activation function discussed above provide the non-linearity we need for our hypothesis.
                (image from : stanford course on visual recognition)

The image above shows a sample neural network. Neural Networks are organized in layers, with each layer containing multiple neuron. It has one input layer which contains our input features, and output layer which predicts the category (oreo/orange), and any layer in the middle is called a hidden layer. There can be multiple hidden layers. Each neuron (the small circles within each layers) performs the operation we discussed above. Note that each neuron is connected to all the neurons in the previous layer, but not within the layer. So now you can think of neural network as, the hidden layer taking input features and performing complex operations on the input and then generating output for each output category. The output of one neuron goes as input to the neuron in the next layer. This small network has 20 weights + 2 biases = 22 total parameters.

These parameters are learnt in following way:

There are two passes through the network, forward pass and backpropagation

Forward Propogation – basically computes the category score for each category, remember score function

Backpropagation – to minimize the loss function, we need to perform gradient descent. In order to perform gradient descent, we need to compute gradients using backpropagation. Just remember that this step is required and no need of going into the details now.

So if you stack up one more hidden layer, one can expect neural network to learn even more complex non-linear functions. But after adding than 5-6 hidden layers the network does not improve a lot, but this is not the case with deep networks as discussed later.

Back to image classification:

In computer vision, we consider each pixel value to be a feature. Imagine if you are given an 200x200 size image. Each pixel has three values: Red, Green and Blue (RGB) values. So we have totally 200*200*3 = 120,000 weights for each neuron in the first hidden layer. And the hidden layer will have a lot more neurons, you can see how this network would have millions of parameters in the first hidden layer. This architecture is very wasteful and has a lot of parameters, which would quickly lead to “overfitting”. In short do not use this architecture.

Convolutional Neural Networks(CNNs)

CNNs are designed especially for images. These networks have the same forward and backward propagation as the normal neural nets, but are less wasteful. These networks now have a special type of layer called Convolution layer, in which each neuron looks at a different window/local region in the image. All these neurons share the same weights, so you can see how parameter sharing reduces the total number of parameters.  A convolutional layer is shown below. Where the pink layer is the input volume, with 32x32 image and 3 channels – Red, Green and Blue.

The features learnt by these networks are very good compared to the normal neural nets, the first few layers learn generic features (eg. edges) and as we go deeper the layers learn features pertaining to the training dataset.

                    (image from : stanford course on visual recognition)

Architecture: CNNs have different types of layers – Convolutional layer as discussed above, RELU – adds non-linearity, Pooling layer which downsamples the output to reduce the number of parameters, fully connected layer – same as the ones in ordinary neural network where each neuron is connected to all the neurons in the previous layer.

Rise of Deep Learning:

Few reasons are:
1) With Networks like CNN, it is observed that the more layers you add the better your performance is
2) Deep Networks learn features which are very good and hence they give better performance
3) With computing power delivered through GPUs it has become easier to train deeper networks

Andrew Ng explains basics of the rise of deep learning through this graph below:

So now everyone is trying to train networks which are very deep to gain more accuracy. With Residual learning researchers have been able to train a network which is 152 layers deep.
It turns out that deep networks are able to learn better features to differentiate between categories, and that is the reason why everyone is talking about deep nets.

Here are #parameters used in some famous deep learning architectures:

AlexNet – First work which popularized CNNs in computer vision. It has 40 M parameters
VGGNet – 140 M
With ResNet – Residual Networks, as mentioned above, they were able to train 152 layers, which is 8 times deeper than VGGNets, i.e. around 1.12 Billion parameters

Hence, you need a large amount of data to estimate these parameters. Just imagine finding 1.12 billion numbers, which also end up minimizing our loss function!

Problems with just stacking layers:
It is observed that if one just keeps on stacking layer after another the training error increases with the total number of layers. The network has to learn more complex functions.  This problem is address by the authors of ResNets