UB Labs: March 2017

Ref: https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/

---------
Part 1 focuses on introducing the main concepts of deep learning.
Part 2 provides historical background and delves into the training procedures, algorithms and practical tricks that are used in training for deep learning.
Part 3 covers sequence learning, including recurrent neural networks, LSTMs, and encoder-decoder systems for neural machine translation.
Part 4 covers reinforcement learning.

Feature Engineering was an essential thing for Machine Learning. It was needed for easier learning process.
And it was a difficult process. Every dataset needed a different feature engineering. If we change the data set we had to again come up with new features.

feature engineering remains the single most effective technique to do well in difficult tasks.

Know the meaning of Regression. Simply it means, statistical relation between input and output.

Deep Learning learns the features. So, feature engineering is replaced by feature learning.

Imagine how difficult it would have been if we had to tell all the features of a cat, building, car?

Anyway, even humans are not taught in that way. Our parents never taught us that, since it has paws, eyes, tail, ears hence, you call it cat. If we were taught that way we would have called a tiger a cat! Which is technically right, though)

We saw a cat a hundred times and we ourselves understood that it is a cat. Same is expected from a machine doing deep learning.

One major difference is, we were able to understand a cat just with a hundred or tens of images of cat. Whereas a machine needs thousands or millions to get to that understanding!

And of course the power we consume. Machines consume exorbitant amount of power to run such algorithms!

We underestimate human capability so much, indeed! But what can we say. We have evolved these skills over millions of years. And the computers have just started. And I would say, they are already evolved pretty advanced in a matter of few decades!

One more reason for enhanced interest in Deep learning in recent years. Didn't understand one or two terms in it. But, it seems now we have better activation function that takes care that gradient don't become too small for the later layers to process input. Previously, these activation functions were absent. And they used to face Vanishing gradient issues.
Source:

"While hierarchical feature learning was used before the field deep learning existed, these architectures suffered from major problems such as the vanishing gradient problem where the gradients became too small to provide a learning signal for very deep layers, thus making these architectures perform poorly when compared to shallow learning algorithms (such as support vector machines).

The term deep learning originated from new methods and strategies designed to generate these deep hierarchies of non-linear features by overcoming the problems with vanishing gradients so that we can train architectures with dozens of layers of non-linear hierarchical features. In the early 2010s, it was shown that combining GPUs with activation functions that offered better gradient flow was sufficient to train deep architectures without major difficulties. From here the interest in deep learning grew steadily."

Non-linear hierarchical features: What does this mean?

LSTM is another reason for the success story of Deep learning. It talks about the time variable dependency. I don't know if helps in image recognition but may be videos. Where the time factor is involved. And what it does is, it correlates the output data with hundreds of older inputs and outputs.

This is very unlike the practices earlier where they used to use only upto 10 previous inputs. This technique was introduced by some two guys in 1997. But it picked up only in recent years.

May be this LSTM and activation functions are related. I am just guessing!!

Perceptron is similar to neurons in our brain.

The one app which used AI and got famous. Any guesses? It's Prisma. There are very few people who wouldn't have heard of it.
It turns normal pictures into art-like photos. This was one of the biggest buzz in the last year!

In this context, one can see a deep learning algorithm as multiple feature learning stages, which then pass their features into a logistic regression that classifies an input.

Logistic regression is a simple and well known algorithm. It doesn't have any hidden layers. And can work on very less data.

So, now into the Artificial neural networks! What happens there.

1) Take the input
2) Apply a transformation with a weighted sum over all the inputs.
3) Turn into into an intermediate state with a non linear function.

This will give you the feature. This entire thing happens in a layer. And the Transformative function is called a unit.

Like any other learning, the neural network gets better by looking at the error or cost and modifying the weights. Probably, at the second step.

So, the word convolution that we here so often is the one involved in 2nd step. As a transformative function. Similar to this are pooling. And also Units. Which are mostly the activation functions that we were talking about earlier.

Units are generally compared to neurons. The word neurons is too misleading. People compare it to how our brains work. But, in fact, recently, people started realising that biological neurons are similar to entire multi layer peceptrons(units) rather than a single unit. Hence it is better not to use the word neurons and prefer the word unit. or even perceptron. My guess is, perceptrons are the traditional approach to deep learning without good activation functions.

Weighed data: Matrix multiplied with input and weight matrix.

Difference between an activation function and a unit is that a unit can have multiple activation functions. Like an LSTM or it can have more complex structure. (Doesn't it just mean a more complex activation function?)

Is it necessary that activation functions are always non-linear?

I don't think so. We just saw a rectified linear function.

Difference between linear and non linear functions.

With non linear functions we create new relations between input parameters.

It seems using non linear functions creates increasingly complex features in deep learning.

Big point is, a chain of layers(even 1000s) with linear functions is equivalent to a single layer. Because a chain of matrix multiplication can be represented by a single matrix multiplication.

That's why non linear functions are so important in deep learning.

A layer is usually uniform. Which means, it only applies one type of activation function on it, so that it can be easily compared to other parts of network.

The first and last layers are called input and output layers and the ones in the middle are hidden layers.

Hidden doesn't mean, it is hidden from the developer as well. Just that it is neither the input or output.

UB Labs

Monday, March 27, 2017

TensorFlow and Deep Learning without a PhD, Part 1 (Google Cloud Next '17) -

Sunday, March 26, 2017

Deep Learning from Nvidia blog!