# General Sequence Learning using Recurrent Neural Networks

Hi everyone, this is Alec and I’m going to

be talking to you today about using recurrent neural networks for text analysis.

To get an understanding of why this is a potential tool to use, it’s good to look at a bit of

a history of how text analysis has been done, particularly from a machine learning perspective.

So in machine learning typically we’re used to vector representation so we know how to

deal with numbers. For categories we would use a one HOT vectorization

model. But when we move to trying to understand and

classify and regress sequences for instance, it becomes much less clear because our tools

are typically based on vector approaches. The way this is typically dealt with is by

computing some hard coded feature transformations, for instance, using TFIDF vectorizers, some

sort of compression model like LSA and then plugging a linear model, such as support vector

machine or a softmax classifier, on top of that.

The purpose of this talk today is what happens if we cut out those techniques and instead

replace them with an RNN. To get an understanding of why this might

be an advantage, structure is hard. Ngrams are the typical way of preserving some

structure. This would be we take our sentence, for instance,

‘the cat sat on the mat’, and we re-represent it as the occurrence of any individual word

or combinations of words. These combinations of words begin to get us

the way to see a little bit of structure. By ‘structure’ I mean preserving the ordering

of words. The problem with this is once we have bigrams

or trigrams any combination of two or three words quickly becomes huge possibilities.

You can have easily 10 million plus features and this begins to get cumbersome, require

lots of memory and slows things down in and of itself.

Structure, although it’s difficult, is also very important.

For certain tasks such as humor or sarcasm, looking at a collection of the word ‘cat’

appeared or the word ‘dog’ appeared isn’t going to cut it.

And that’s what a lot of our models today do.

To understand though why many models today are based on this and do quite successful,

ngrams can get you a long way from any task. Specific words are often very strong indicators:

‘useless’ in the case of negative sentiment and ‘fantastic’ in the case of positive

sentiment. If you’re, for instance, trying to classify

whether a document is about the stock market or is a recipe…you don’t see the word

‘green tea’ come up very much in a stock market conversation and you don’t see the

word ‘NASDAQ’ come up very much in a recipe. You can quickly separate things in those kinds

of tasks. It’s often a question of knowing what’s

right for your task at hand. If you’re trying to get at a more qualitative

understanding of what’s going on in a body of text this is where structure may be very

important. Whereas if you’re just trying to separate

out something that may be very indicative on a word level, then in many ways a bag of

words model can be quite strong. How an R&N works.

To understand its potential advantages over a bag of words model what an RNN does is it

reads through a sequence iteratively which is really nice because it’s how people do

it as well. It’s able to preserve some of the structure

of the model. It goes through each word and updates its

hidden representation based on that word and the input from the previous hidden state.

At time zero where we have no previous hidden state we feed in either a bunch of zeroes

or we treat it as another parameter to be learned in [INAUDIBLE] representation.

It just continues to do this all the way through the sequence.

At each time step we have a 512, in this case we had 512 hidden units, dimensional vector

representation of our sequence. It’s a way of taking the sequence of words

and using time step converting it into fixed length representation.

As a bit of notation, arrows would be projections dot products and boxes would represent activities. Vector is the values.

For instance, the activation of each hidden unit would be this box.

It just proceeds iteratively through. It’s important to note these projections

are largely shared across all time sequences. This projection with this arrow shared for

all inputs, across all time steps and this hidden to hidden unit connections are preserved

as well across the sequence. This is what makes learning tractable in these

models. At the end of iterating through the sequence

we’ve got now a learned representation of the sequence, a vector form of our sequence

which then can be used by slapping on a traditional classifier.

In this toy example what we’ve done is we’ve read in an input sentence.

For instance, we’re trying to teach the model how to classify the subject of a sentence.

You can also stack them. Just as your RNN can go through an input sequence

and return its internal representation of that sequence, you can then train another

RNN on top of it or you can jointly train both. The structure is actually quite flexible. One final note is the way that we do this

original input to hidden feed forward is this is typically represented either as a traditional

one HOT which really doesn’t get us too much advantage but what’s really exciting

is we can represent this as what’s called an ‘embedding matrix’. These words, ‘the cat sat on the mat’,

would be represented as indexes into a matrix. ‘The’ would be represented as index 100. When we read through the sequence we would

look up the row, row 100, and we would return as the input being fed into the RNN the learned

representation in that embedding matrix. Let’s say we had 128 dimensions to be learned

as an input representation for our words. It would be equivalent to 128 by let’s say,

10,000 matrix if we do 10,000 words. We would then feed this in as input. That’s really cool because we can treat

it as a learned w ay to learn representations of our words. We’ll look later in this presentation at what

those actually look like. They give the model a lot of power. The big thing in the literature is RNNs have

a reputation for being very difficult to learn. They are often known to be unstable in simple

art in strange with generic stochastic gradients are actually very unstable and difficult to

learn. What has happened in the research literature

over the last few years is there are a bunch of tricks that have been developed that help

them be much more stable, much more powerful and much more reliable, effectively. To get an understanding of these we’re going

to go quickly through all these various tricks. The first of these is gating units. To understand what a gating unit is we first

need to look a little bit more into detail how a simple RNN works. What happens is we have our hidden state from

our previous time step. Again, at the original time step this can

just be zeroes or parameters and we receive input at time step T. We take input from the hidden state of h of

T minus one and we take input of T. We just add them together for instance via

a dot product projection and then we apply an element wise activation function like [INAUDIBLE]

for instance. Then we have a new hidden state. At the next time step we receive more input,

we add it together and apply another element wise activation function and this process

continues forward. To understand what the problem that can be

with this, is information is always being updated at each time step. As a result it becomes difficult for information

to persist through a model like this. You can think of this as a form of exponential

decay. If we have a value let’s say here, of one

and through this process we effectively end up multiplying that value by a reasonable

.05 what happens after the course of several time steps is that value will exponentially

decay for instance, to zero. Information has difficulty spreading through

a structure like this. There have been various changes called ‘gating

units’ to make this work better. What a gating unit does is instead of having

the hidden state at a new time step be a direct operation of the previous time step, it adds

in a variety of gating units that effectively transform the information in a more structured

way. One of these is called the ‘gated recurrent

unit’ introduced recently. What it does is it uses two types of gates. It uses a reset gate and a dynamics gate. This is the reset gate and this is the dynamics

gate. What the reset gate does is it takes an input

from a previous time step, both the hidden representation and the input, and it computes

a…there should be element wise sigmoid squash in here, and what it does is it basically

computes how much of the previous time steps’ information should continue along this route. A reset value could be anywhere between zero

and one and what it does is it multiplies the previous hidden states by those reset

states. What this does is it allows a model to adaptively

forget information. For instance, you can imagine for sentiment

analysis that some of our information might be only relevant on the sentence level and

once you see a period your model can then clear some of its information because it knows

the sentence is over and then be able to use it again. Once we have the previous times states information

effectively gated by a reset gate, we then update and get our potential new hidden state,

h~t. What we then do is we use the dynamics gate

to instead of just using h~t as would be a somewhat similar model to the previous example,

we average it effectively with the previous hidden state based on the dynamics gate. We take the output of h~t, multiply it by

Z again is like our computes values between and one for each unit and we multiply it and

add together. This effectively [INAUDIBLE] If ‘Z’ is zero what it would do is it

would take entirely the new updated h~t and none of the previous hidden state. This would be equivalent in many ways to our

previous simple recurrent unit. In that case we would just have the new value

come through. Sure it would be gated by the recurrent unit

but again the new hidden state would be a completely new update compared to the previous

hidden state. Whereas if Z is one and this value here is

one and this value here is zero so then our new hidden state is just a copy effectively

of the previous hidden state. Z values of near one are ways of effectively

propagating information over longer steps of time. You can think of it as the easiest way to

remember something is just not to change it. A value can be spread over very long periods

of time if we just have Z values near one because that way we don’t have all this noise

of updating our hidden states. We just lock it and let that value persist. That’s a way to, for instance, on the sentence

level keep context from the previous sentence around if we had Z values that were locked

on. Again, like in our previous model, this can

be expanded to another time step just so you see how information flows. Again, there are all these calculations involved

in updating our hidden states and our gating values but the information at its core really

flows through this upper loop of gated values interacting with previous hidden states. Gating is essential. That was enough of an example of all the theoretical

reasons why these better designed gates might help propagate information better. But empirically it’s also very important. For sentiment analysis of longer sequence

of text, for instance, a paragraph or so, a few hundred words for instance, a simple

RNN has difficulty learning it all. You can see that it initially climbs downhill

a little bit but all it’s actually doing here is just predicting the average sentiment,

for instance, 0.5. Whereas a gated unit, a recurrent neural network,

is able to quickly learn and continuously learn. Again, you can’t use simple recurrent units

for these more complex tasks especially when you have longer sequences of 100 plus words

or tokens. They just don’t work well because information

is hard to keep over longer sequences of time in those kinds of models. Now that we’ve talked about gating there’s

another question which is what kind of gating do you use. There are two types of models that have been

proposed. Gated recurrent units by Cho, recently from

the University Montreal, which are used for machine translation and speech recognition

tasks. Then there’s also with the more traditional

long short term memory. This has been around much longer and has been

used in far more papers. Various modifications to the classic architecture

exist but for text analysis GRU seems to be quite nice in general. It seems to be simpler, faster and optimizes

quicker at least on the sentiment analysis dataset. Because it only has two gates compared to

LSTM’s four it’s also a little bit faster. If you have a larger dataset and you don’t

mind waiting a little bit longer, LSTM may be better in the long run especially with

larger datasets because it has additional complexity with more gates. But again it seems like GRU does quite well

in these kinds of problems generally. I tend to favor it myself but you can try

both. The library we’ll be introducing later in

this talk supports both. The next question is exploding gradients. Exploding gradients are a training dynamics

phenomena that happens in recurrent neural networks where the values that we’re trying

to update [INAUDIBLE] at each step of our training algorithm can become very large and

very unstable. This is one of the sources of the reputation

of RNNs being hard to train. Typically you would see small values, for

instance the norm of your gradient would be around one and just bouncing around and then

sometimes you’d see huge spikes. Those spikes can be quite damaging because

a traditional learning update would then rapidly change your values and this could result in

unstable oscillations and your whole model explodes. In 2012 there was a great paper that proposed

simply clipping the norm of the gradient. If the gradient exceeded a set value, for

instance 15, it would just be reset and scaled to that value. This was a common form of making RNNs much

more stable. Interestingly though, at least on text analysis

for sentiment, we don’t seem to see this problem with modern optimizers. It seems that the gradient decays pretty cleanly

and becomes quite stable over the course of learning. There’s another way of making recurrent neural

networks better and this is by using better gating functions. There was an interesting paper this year at

NIPS the basic idea of which was let’s make our gates steeper so they change more rapidly

from being a value of zero to a value of one. What this means is a traditional sigmoid would

change pretty smoothly between negative five and five. But when you randomly initialize one of these

numbers at the beginning of training typically your values wouldn’t lie along the average

0.5, for instance. You wouldn’t see much dynamics here. If we make our gate steeper what that means

is our gates begin to rapidly switch between zero and one much more easily, particularly

near the beginning of learning. What this seems to suggest is that models

that have used these steeper gating units tend to learn a bit faster because they begin

to learn how to use these gates quicker. This is another quick easy technique to add. Again, the library we’ll be introducing

later in this talk supports to help make learning better in these models. Another technique is orthogonal initialization. Andrew Saxe last year did some great work

on showing that initializing. When we begin training these models we don’t

know the values of these parameters to use in these dot products, for instance; the weight

matrices effectively. What the research literature typically does

is initialize [INAUDIBLE] for instance, random Gaussian or random uniform noise. What this research showed is that using random

orthogonal matrices worked much better. It’s in line with some previous other work

that has also noted various forms of similar initializations worked well for RNNs. Now we want to understand how we train these

models. There are a variety of techniques that can

be used. This is a visualization of the training dynamics

of various algorithms on toy datasets where we’re trying to classify these red dots

from these blue dots. We only have a linear model so all it can

do is learn effectively a line separating these two. It can’t do it perfectly because there’s always

going to be values separating this. What we see is that the traditional most basic

optimizer is stochastic gradient descent whereas there are these various other improvements

and techniques. The main point of this example is to demonstrate

to not use Sgd effectively. Sgd very early on in training can look quite

similar but once the norm of your gradients becomes slower due to later stages optimization

you want some sort of dynamicism to your learning algorithm whereas Sgd once it gets out of

the very steep earlier areas of learning tends to slow down. This is particularly a problem oftentimes

in the space of text analysis because we have very sparse updates on words, for instance. There are rare words that you only see once

every thousand or 100,000 words and those words are very difficult to learn in a traditional

Sgd framework. Whereas these various techniques like momentum

and [INAUDIBLE] accelerated gradient what they do is effectively average together multiple

updates and accumulate those averages. They’re a form of smoothing out this stochastic

noise and accelerating directions of continuous updates. There’s another family of acceleration methods;

the adafamily that effectively scale the learning rate, the amount by which we update a parameter

given a gradient by some dynamics, some heuristics describing the local gradient. In the case of adagrad what we do is we accumulate

the norm of the gradients update seen so far with respect to a parameter and we scale our

learning rate. It’s a form of learning rate where we can

see that early on it learns quite quickly and later on it begins to slow down as it

reaches in this case near [INAUDIBLE]. Adadelta an RMS prop do something a little

bit like that but make it dynamic. It’s based on the local history instead

of the global history of the gradients for a parameter. There are a variety of optimizers and one

recently introduced called ‘Adam’ combines the early optimization speed that we saw in

that earlier example of adagrad with the better later convergence of various other methods

like adadelta and RMS prop. This looks quite good for text analysis in

RNN. We can see that Adam gets off to a very early

learning start just like adagrad. These results…actually there’s a slight

bug in my code for this so take them with a grain of salt but they still look good and

it’s a bug in the code so it might still be okay. That might actually explain one of the reasons

why we saw slightly worse generalization performance. It would train quite well but we would see

its performance on held out data might not have been as good for Adam because it learned

so much more quickly. We’re still looking into reasons why this

happens but in general, modern optimizers are essential on these kinds of problems. This just gives you a background on all the

various techniques for making RNNs more efficient in training and it can add quite a lot. Early on in learning we can see that Adam

and all these other techniques added together so this would be a just a standard gating

RNN. Again, if we had a simple RNN on here it would

look pretty linear. If we add gradient clipping to make it more

stable so we can use a slightly larger learning rate it begins to learn faster. If we add orthogonal initialization we can

see again that it began to learn faster and learn better. Finally, if we had only Adam we see another

huge gain over traditional Sgd. These add up. We can see that Adam and all these other techniques

are able to reach lower effective minima and are at least faster. Up to 10x faster. Admittedly these techniques add a little bit

of computation time so it might only be for instance 7.5x faster on a wall clock compared

to efficiency per rate update. This is interesting because now RNNs can actually

overfit quite a lot. As they continue to fit to training data for

instance their test data might plateau. We continue to improve on the training dataset

we’re given but this is called ‘overfitting’ where our RNN is effectively optimizing for

the details of the training data that aren’t true of new data. To combat this one of the techniques that

is used is called ‘early stopping’ which is each iteration of our dataset we will record

the train and test validation training test scores of these models and we will stop once

we notice that our test validation performances are improving. Oftentimes this is going to occur in your

first or second iteration through the dataset with all these various techniques together. That’s good news because oftentimes models

in this space can take ten, 50 or 100 iterations for your training data to converge. It seems in the case of RNNs we often overfit

after one or two epics through the data. To understand and get a better sense of how

these models can do we’re going to compare them to a much more standard technique in

the literature. We’re going to use the Fantastic Machine

Learning Library, SKLearn, and we’re going to use a standard linear model approach, a

traditional approach to text analysis. This would be using at TFIDFI vectorizer and

a linear model such as logistic regression. This is by no means meant to be the best model. In many cases, naïve Bayes SVN is actually

better than [INAUDIBLE] regression for classification for instance, but this is just a very easily

accessible, very easily comparable to technique To be fair, we’re going to use bigrams which

is a way of getting a little bit of structure into our data. Again, this way we could see ‘not good’

instead of just seeing the tokens for ‘not’ and ‘good’ occurring. We can get a little bit of structure which

might be useful in sentiment analysis. We’re going to use grid search to evaluate

potential [INAUDIBLE] for these linear models. We’re going to look at two which is minimum

document frequency which is way of controlling for the size of our input to our linear model. This would take tokens or words that appear

less than, in less than and many documents and would ignore them. If we see, for instance, the word ‘dinosaur’

and we’ve only seen it once in our dataset we’re going to ignore it effectively. Also we’re going to look at the regulization

coefficient which is a way of preventing overfitting for the new models. What we’re doing is grid search so we’re

looking at potential values for both of these. We’re not just explaining a potential performance

improvement based on poorly fitted parameters. Because these linear models tend to be faster

we are able to more effectively search over potential parameters. This is a fair way to get the linear model

potential advantage because they’re much faster so we can much more quickly search through

multiple values. Our second model we’re going to be looking

at is one of these recurrent neural networks. Admittedly, this is our own personal research:

take every result with a grain of salt. I’m using whatever I’ve tried that worked. The general message though is that using a

modern optimizer such as Adam, a gated recurrent unit, steeper sigmoid gates and orthogonal

initialization are good defaults. A medium-size model that can work quite well

is a 256 dimensional embedding and a 512 dimensional hidden representation. Then we put on whatever output we need: logistic

regression for binary sentiment classification, linear regression for predicting real values,

etc. It’s quite flexible because the RNN in its

core is a way of taking these sequences of values and converting them into a vector. Once we’ve got that vector we can put whatever

traditional model we want on top of it so long as it’s differential and open to gradient

based training. How does this work on datasets? What we see quickly here is that our linear

regression model does incredibly well for smaller datasets. When we have for instance only 1,000 or 10,000

training examples we see that the linear model outperforms the RNN by 50%, for instance. But what we notice that’s interesting as

our datasets get bigger the RNN tends to scale better until till later training into larger

dataset sizes. Because the RNN is admittedly a much more

complex model and operates on the sequences themselves ideally with more training data

it can learn a much better way to do the task at hand. Whereas your linear model because it’s operating

on unstructured bag of words and is just a linear model might eventually hit a wall where

it’s not able to do any better. You can imagine certain situations that you

just aren’t going to be able to classify the sentiment, positiveness or negativeness of

a text when it uses double negation, for instance. That’s one example with sentiment analysis. What’s also interesting is we see this replicated

for instance for predicting the helpfulness of a customer review. This is interesting because this is a much

more qualitative thing. Sentiment is as well but how helpful a user’s

review of a product is even more getting a much more abstract concept. We see again that as before with small amounts

of data the linear model, in this case reg since we’re predicting real values, does much

better but it doesn’t seem to scale and make use of more data as effectively as an RNN. This is interesting. We can see that RNNs seem to have poor generalization

properties with small amounts of data but they seem to be doing better when we have

large amounts of data. At one million labeled examples we can often

be between zero and 30% better than the equivalent linear model. Again these are just these examples with logistic

regression and linear regression but that crossover seems to be robust and somewhere

between 100,000 and a million examples but it is dependent on the dataset. There’s only one unfortunate caveat to this

approach which is it’s quite slow. For a million paragraph size text examples

to converge that linear model takes about 30 minutes on a single CPU core. For an RNN if we use a high-powered graphics

card such as the GTX 980 it takes about two hours. That’s not too bad. Our RNN on a proper high-end graphics card

is only about four times slower at a million examples to converge than the linear model. Again this is on a basic CPU core. But if we train our RNN on just that CPU core

it takes five days. This is unfortunate because this means our

RNN is about 250 times slower than a CPU and that’s just not going to cut it. This effectively is why we use GPUs in this

research. Here’s the cool part of the presentation. Again, an RNN when it’s being fed an input

sequence takes in the sequence and effectively learns a representation for each word. Each word gets replaced from its identifier

some value like ‘the’ is token 100 and gets replaced with a vector representation

that is learned by our model. These visualizations we’ll be showing you

are what happen when you look at what representations are learned by those models. What we’re going to do is use an algorithm

called ‘TSNE’ to visualize these embeddings that our RNN learns. What we’ve done to make it a little clearer

is this is the representations learned from training on only binary sentiment analysis. We’re trying to predict whether a given

customer review, for instance, likes a product or doesn’t like a product. What we’ve done is we’ve visualized these

representations in two dimensions using TSNE and we’ve colored each word by the average

sentiment of a review it appears in. What we see is a kind of axis. Again, it doesn’t correspond to any actual

axis aligned because it’s TSNE. But we see this continuum between very negative

words and very positive words. This isn’t too surprising. A model trained on sentiment analysis learns

to separate out negative and positive words. That’s what you’d expect to happen. We can take a little look at these very positive

and very negative clusters and see that it’s grouped into very understandable words like

‘useless, waste, poorly, disappointed’ as negative. You can see some interesting stuff where again

this visualization tries to group similar things close together. We can see that it’s actually identified

even though it is a very negative grouping, it’s also identified ‘returned, returning,

returns, return’ all together as well. That’s interesting because it seems to know

that ‘returned’ and return related words are very negative unsurprisingly if you find

them in a review. But it’s also separated them out slightly

from other more generic words. Then on the positive side we also see very

unsurprising indicators of happy if sentiments. So ‘fantastic, wonderful, and pleased’. But what’s even more interesting about this

model is that we see other forms of grouping and structure being learned. We see that it pulls out for instance, quantities

of time; weeks, months, hours, minutes. We also see that it pulls out qualifiers like

‘really, absolutely, extremely, totally’. Again, qualifiers are interesting because

they are by themselves neutral. They don’t necessarily indicate positive or

negative sentiment; instead they modify it. You can have ‘extremely good’ and ‘extremely

bad’. You see that being pulled out together. You also see product nouns, for instance things

that products could be, things that are products like movies, books, stories, items, devices

are also grouped together. Additionally, punctuation is grouped together. This is indicative potentially of our model

learning to use these kinds of data which again implies that our model may actually

be learning to use some of the structure present in the data. Punctuation by grouping it together and learning

similar representations for it imply that it’s finding some use for it. We would expect again punctuation to be quite

useful for segmenting out and separating out meanings and notions. Quantities of time are interesting. They are slightly negatively associated which

is understandable when you talk about, ‘this product took months to show up’ or, ‘it

worked for a total of an hour’. Again, grouping them all together implies

some use of it and the same thing with qualifiers. We have no true evidence at least in this

picture of these words being used but by learning similar representations and by having them

grouped together it implies it’s finding a use for them. We can extrapolate from there that it may

in fact be learning to use these words in natural ways for sentiment analysis. Again, this is learned purely from zero and

one binary indicator variables. This is a bit like seeing a sequence of numbers,

1,000, 2,000, 3046, five and then realizing that tokens five and one thousand are exclamation

point and period. They’re similar to tokens 2,000 and 7,000

which are comma and colon. This is a very strong result and very interesting

to see this kind of similarity being learned by our model. This is cool but how can we actually use these

models? We’re also presenting today a basic library

to allow developers to use these recurrent neural networks for text analysis. It’s called Passage and it’s a tiny library

built on top of the great Theano machine learning for framework [INAUDIBLE] math library. It’s incredibly alpha; we’re working on it

but it has a variety of features. We’re going to walk through now an example

of how to use Passage. This is Passage. It’s clonable via GitHub and it has a variety

of tools to make this useful. This is a little example we’re going to walk

through and explain real quickly on how we can use Passage to do analysis of text. We need to import the components that are

necessary. One of these is the tokenizer which is a way

of taking strings of text and separating them out into the individual tokens which would

be words and punctuation, for instance. A tokenizer can just be instantiated. It has a variety of parameters but has sensible

defaults. What we do is we emulate a SKLearn style interface. We can call fit transform on a body of training

text which would be again a list of strings for instance and that would return a list

of these training tokens which can be used natively by Passage to train RNN models. Additionally, we’re going to import the various

layers of a model. We have that embedding matrix we talked about,

the gated recurrent unit, and a dense output classifier. The way that we compose these into a training

model is by stacking them together in a list. Our input is one of these embedding matrices

and we’re going to set it to have 128 dimensions. We need to know how many of these features

to learn, how many of these tokens there are, and we’re going to just pull that out of how

many our tokenizer decided we needed. Then we’re going to use one of these gated

recurrent layers where in this case setting its size to 128. The sizes are sometimes smaller than you would

use for actual models and you can see better performance from larger models, for instance,

but these are small enough to be run on a CPU and not take forever. They’ll still take quite a while though. Finally, we have our dense output unit which

would be if we were doing binary sentiment classifications, detecting if it’s negative

or positive for a string of text, would be one unit because we’d be predicting one

value. We would use a sigmoid activation as a way

of quickly separating out negative and positive values. Then to make this model we instantiate it

through the model class which is just importable from Passage dot models RNN. We give it the layers we want to build our

custom architecture out of and we tell it what cost function we want to optimize. The cost function is the effective function

that lets us train this model. It’s just a way of telling the model how good

was this, how good did you do on this example, effectively. For binary classification we use binary [INAUDIBLE]

in this example. To train this model we just call a fit interface

which takes in training tokens which are made from training text and also takes in the training

labels we want to predict given those training texts. Then once that model has been trained…It

should be noted this only transfers one iteration through your dataset. As mentioned earlier, you may want to train

for multiple iterations if for instance your model hasn’t converged and you may want

to measure your performance on hold out data to know when to stop training a model if it

begins to overfit. Right now we’ve left that part to you but

we will be extending this to have interfaces to automatically do this. Finally, if you want to have your model then

predict on new data you can just call model dot predict on tokenizer dot transform or

test text and this will return how the model predicts new data. That’s an example of how to use Passage. To summarize, RNNs are now a potentially competitive

tool in certain situations for text analysis. Admittedly there are a lot of disclaimers

there but there seems to be a general trend that seems to be emerging which is if you

have a large, for instance, million-plus example dataset and you have a GPU they can look quite

good. They potentially can outperform linear models

and might not take all that much longer. But if you have a smaller dataset and don’t

have a GPU it can be very difficult to justify despite how cool these models might seem compared

to linear models. They are a lot slower, they have a lot of

complexity, a lot of different parts and a lot of different architectures you can change

and they seem to have poor generalization results with small datasets. Thanks for listening. If you have any questions you can let me know

at Alec at Indico dot I-O. Also if you’d like to see a more general introduction

to machine learning and deep learning in Python I have another video that you can check out

in the upper right, introducing that as a Python developer, how to use the awesome Theano

library to implement these algorithms yourself. Additionally if you’d like to check out and

learn more about Indico feel free to visit our website at Indico dot I-O where we have

various tools like Passage available for developers to use for machine learning. Thanks.

How did you regularize the RNN?

Alec, how on earth are you so succinct? Your presentations are phenomenal! I learn so much so quickly, and I eagerly await future videos from you and indico.

nice speech

Fantastic explanation of RNN use for text classification. Thanks!

clear and useful tutorial, thanks!

Thank you, but in minute 3:00, the subtitle says "R&N" shouldn't be "RNN"?

REALLY GREAT and comprehensive tutorial! thanks!

Very useful tutorial, many thanks.

very nice tutorial, more clear than all the tutorials I've seen on the web, thanks!

Great explanation! Although, it should be at least mentioned that minimizing training error does not necessarily converge towards real objective minima (aka generalization). The vertical axis in his plots (at least at 20:13) show "Training Loss", it might be that authors actually showed loss decreasing on a held out set.

Is the TSNE on the word embeddings?

i saw your video, it's clear and so helpful , thank you. Iwish if you could propose a code on Matlab for the RNN 🙂

Awesome explanation. These type of videos are highly needed as a starter before delving into more detailed papers. Create more of them on Deep NLP.

Very clear visualization about how RNN works, thanks a lot!

Good video as the starting to learn RNN. For the visualization of the words colored by average sentiment. What do you mean by average sentiment of one word? We can obtain the vector representation of each word after the training procedure for sentiment classification, but how to get the sentiment of each word for one sentence? Thanks!!

I paused after watching half, and this is very clearly explained. Well done