General Sequence Learning using Recurrent Neural Networks

Hi everyone, this is Alec and I’m going to
be talking to you today about using recurrent neural networks for text analysis.
To get an understanding of why this is a potential tool to use, it’s good to look at a bit of
a history of how text analysis has been done, particularly from a machine learning perspective.
So in machine learning typically we’re used to vector representation so we know how to
deal with numbers. For categories we would use a one HOT vectorization
model. But when we move to trying to understand and
classify and regress sequences for instance, it becomes much less clear because our tools
are typically based on vector approaches. The way this is typically dealt with is by
computing some hard coded feature transformations, for instance, using TFIDF vectorizers, some
sort of compression model like LSA and then plugging a linear model, such as support vector
machine or a softmax classifier, on top of that.
The purpose of this talk today is what happens if we cut out those techniques and instead
replace them with an RNN. To get an understanding of why this might
be an advantage, structure is hard. Ngrams are the typical way of preserving some
structure. This would be we take our sentence, for instance,
‘the cat sat on the mat’, and we re-represent it as the occurrence of any individual word
or combinations of words. These combinations of words begin to get us
the way to see a little bit of structure. By ‘structure’ I mean preserving the ordering
of words. The problem with this is once we have bigrams
or trigrams any combination of two or three words quickly becomes huge possibilities.
You can have easily 10 million plus features and this begins to get cumbersome, require
lots of memory and slows things down in and of itself.
Structure, although it’s difficult, is also very important.
For certain tasks such as humor or sarcasm, looking at a collection of the word ‘cat’
appeared or the word ‘dog’ appeared isn’t going to cut it.
And that’s what a lot of our models today do.
To understand though why many models today are based on this and do quite successful,
ngrams can get you a long way from any task. Specific words are often very strong indicators:
‘useless’ in the case of negative sentiment and ‘fantastic’ in the case of positive
sentiment. If you’re, for instance, trying to classify
whether a document is about the stock market or is a recipe…you don’t see the word
‘green tea’ come up very much in a stock market conversation and you don’t see the
word ‘NASDAQ’ come up very much in a recipe. You can quickly separate things in those kinds
of tasks. It’s often a question of knowing what’s
right for your task at hand. If you’re trying to get at a more qualitative
understanding of what’s going on in a body of text this is where structure may be very
important. Whereas if you’re just trying to separate
out something that may be very indicative on a word level, then in many ways a bag of
words model can be quite strong. How an R&N works.
To understand its potential advantages over a bag of words model what an RNN does is it
reads through a sequence iteratively which is really nice because it’s how people do
it as well. It’s able to preserve some of the structure
of the model. It goes through each word and updates its
hidden representation based on that word and the input from the previous hidden state.
At time zero where we have no previous hidden state we feed in either a bunch of zeroes
or we treat it as another parameter to be learned in [INAUDIBLE] representation.
It just continues to do this all the way through the sequence.
At each time step we have a 512, in this case we had 512 hidden units, dimensional vector
representation of our sequence. It’s a way of taking the sequence of words
and using time step converting it into fixed length representation.
As a bit of notation, arrows would be projections dot products and boxes would represent activities. Vector is the values.
For instance, the activation of each hidden unit would be this box.
It just proceeds iteratively through. It’s important to note these projections
are largely shared across all time sequences. This projection with this arrow shared for
all inputs, across all time steps and this hidden to hidden unit connections are preserved
as well across the sequence. This is what makes learning tractable in these
models. At the end of iterating through the sequence
we’ve got now a learned representation of the sequence, a vector form of our sequence
which then can be used by slapping on a traditional classifier.
In this toy example what we’ve done is we’ve read in an input sentence.
For instance, we’re trying to teach the model how to classify the subject of a sentence.
You can also stack them. Just as your RNN can go through an input sequence
and return its internal representation of that sequence, you can then train another
RNN on top of it or you can jointly train both. The structure is actually quite flexible. One final note is the way that we do this
original input to hidden feed forward is this is typically represented either as a traditional
one HOT which really doesn’t get us too much advantage but what’s really exciting
is we can represent this as what’s called an ‘embedding matrix’. These words, ‘the cat sat on the mat’,
would be represented as indexes into a matrix. ‘The’ would be represented as index 100. When we read through the sequence we would
look up the row, row 100, and we would return as the input being fed into the RNN the learned
representation in that embedding matrix. Let’s say we had 128 dimensions to be learned
as an input representation for our words. It would be equivalent to 128 by let’s say,
10,000 matrix if we do 10,000 words. We would then feed this in as input. That’s really cool because we can treat
it as a learned w ay to learn representations of our words. We’ll look later in this presentation at what
those actually look like. They give the model a lot of power. The big thing in the literature is RNNs have
a reputation for being very difficult to learn. They are often known to be unstable in simple
art in strange with generic stochastic gradients are actually very unstable and difficult to
learn. What has happened in the research literature
over the last few years is there are a bunch of tricks that have been developed that help
them be much more stable, much more powerful and much more reliable, effectively. To get an understanding of these we’re going
to go quickly through all these various tricks. The first of these is gating units. To understand what a gating unit is we first
need to look a little bit more into detail how a simple RNN works. What happens is we have our hidden state from
our previous time step. Again, at the original time step this can
just be zeroes or parameters and we receive input at time step T. We take input from the hidden state of h of
T minus one and we take input of T. We just add them together for instance via
a dot product projection and then we apply an element wise activation function like [INAUDIBLE]
for instance. Then we have a new hidden state. At the next time step we receive more input,
we add it together and apply another element wise activation function and this process
continues forward. To understand what the problem that can be
with this, is information is always being updated at each time step. As a result it becomes difficult for information
to persist through a model like this. You can think of this as a form of exponential
decay. If we have a value let’s say here, of one
and through this process we effectively end up multiplying that value by a reasonable
.05 what happens after the course of several time steps is that value will exponentially
decay for instance, to zero. Information has difficulty spreading through
a structure like this. There have been various changes called ‘gating
units’ to make this work better. What a gating unit does is instead of having
the hidden state at a new time step be a direct operation of the previous time step, it adds
in a variety of gating units that effectively transform the information in a more structured
way. One of these is called the ‘gated recurrent
unit’ introduced recently. What it does is it uses two types of gates. It uses a reset gate and a dynamics gate. This is the reset gate and this is the dynamics
gate. What the reset gate does is it takes an input
from a previous time step, both the hidden representation and the input, and it computes
a…there should be element wise sigmoid squash in here, and what it does is it basically
computes how much of the previous time steps’ information should continue along this route. A reset value could be anywhere between zero
and one and what it does is it multiplies the previous hidden states by those reset
states. What this does is it allows a model to adaptively
forget information. For instance, you can imagine for sentiment
analysis that some of our information might be only relevant on the sentence level and
once you see a period your model can then clear some of its information because it knows
the sentence is over and then be able to use it again. Once we have the previous times states information
effectively gated by a reset gate, we then update and get our potential new hidden state,
h~t. What we then do is we use the dynamics gate
to instead of just using h~t as would be a somewhat similar model to the previous example,
we average it effectively with the previous hidden state based on the dynamics gate. We take the output of h~t, multiply it by
Z again is like our computes values between and one for each unit and we multiply it and
add together. This effectively [INAUDIBLE] If ‘Z’ is zero what it would do is it
would take entirely the new updated h~t and none of the previous hidden state. This would be equivalent in many ways to our
previous simple recurrent unit. In that case we would just have the new value
come through. Sure it would be gated by the recurrent unit
but again the new hidden state would be a completely new update compared to the previous
hidden state. Whereas if Z is one and this value here is
one and this value here is zero so then our new hidden state is just a copy effectively
of the previous hidden state. Z values of near one are ways of effectively
propagating information over longer steps of time. You can think of it as the easiest way to
remember something is just not to change it. A value can be spread over very long periods
of time if we just have Z values near one because that way we don’t have all this noise
of updating our hidden states. We just lock it and let that value persist. That’s a way to, for instance, on the sentence
level keep context from the previous sentence around if we had Z values that were locked
on. Again, like in our previous model, this can
be expanded to another time step just so you see how information flows. Again, there are all these calculations involved
in updating our hidden states and our gating values but the information at its core really
flows through this upper loop of gated values interacting with previous hidden states. Gating is essential. That was enough of an example of all the theoretical
reasons why these better designed gates might help propagate information better. But empirically it’s also very important. For sentiment analysis of longer sequence
of text, for instance, a paragraph or so, a few hundred words for instance, a simple
RNN has difficulty learning it all. You can see that it initially climbs downhill
a little bit but all it’s actually doing here is just predicting the average sentiment,
for instance, 0.5. Whereas a gated unit, a recurrent neural network,
is able to quickly learn and continuously learn. Again, you can’t use simple recurrent units
for these more complex tasks especially when you have longer sequences of 100 plus words
or tokens. They just don’t work well because information
is hard to keep over longer sequences of time in those kinds of models. Now that we’ve talked about gating there’s
another question which is what kind of gating do you use. There are two types of models that have been
proposed. Gated recurrent units by Cho, recently from
the University Montreal, which are used for machine translation and speech recognition
tasks. Then there’s also with the more traditional
long short term memory. This has been around much longer and has been
used in far more papers. Various modifications to the classic architecture
exist but for text analysis GRU seems to be quite nice in general. It seems to be simpler, faster and optimizes
quicker at least on the sentiment analysis dataset. Because it only has two gates compared to
LSTM’s four it’s also a little bit faster. If you have a larger dataset and you don’t
mind waiting a little bit longer, LSTM may be better in the long run especially with
larger datasets because it has additional complexity with more gates. But again it seems like GRU does quite well
in these kinds of problems generally. I tend to favor it myself but you can try
both. The library we’ll be introducing later in
this talk supports both. The next question is exploding gradients. Exploding gradients are a training dynamics
phenomena that happens in recurrent neural networks where the values that we’re trying
to update [INAUDIBLE] at each step of our training algorithm can become very large and
very unstable. This is one of the sources of the reputation
of RNNs being hard to train. Typically you would see small values, for
instance the norm of your gradient would be around one and just bouncing around and then
sometimes you’d see huge spikes. Those spikes can be quite damaging because
a traditional learning update would then rapidly change your values and this could result in
unstable oscillations and your whole model explodes. In 2012 there was a great paper that proposed
simply clipping the norm of the gradient. If the gradient exceeded a set value, for
instance 15, it would just be reset and scaled to that value. This was a common form of making RNNs much
more stable. Interestingly though, at least on text analysis
for sentiment, we don’t seem to see this problem with modern optimizers. It seems that the gradient decays pretty cleanly
and becomes quite stable over the course of learning. There’s another way of making recurrent neural
networks better and this is by using better gating functions. There was an interesting paper this year at
NIPS the basic idea of which was let’s make our gates steeper so they change more rapidly
from being a value of zero to a value of one. What this means is a traditional sigmoid would
change pretty smoothly between negative five and five. But when you randomly initialize one of these
numbers at the beginning of training typically your values wouldn’t lie along the average
0.5, for instance. You wouldn’t see much dynamics here. If we make our gate steeper what that means
is our gates begin to rapidly switch between zero and one much more easily, particularly
near the beginning of learning. What this seems to suggest is that models
that have used these steeper gating units tend to learn a bit faster because they begin
to learn how to use these gates quicker. This is another quick easy technique to add. Again, the library we’ll be introducing
later in this talk supports to help make learning better in these models. Another technique is orthogonal initialization. Andrew Saxe last year did some great work
on showing that initializing. When we begin training these models we don’t
know the values of these parameters to use in these dot products, for instance; the weight
matrices effectively. What the research literature typically does
is initialize [INAUDIBLE] for instance, random Gaussian or random uniform noise. What this research showed is that using random
orthogonal matrices worked much better. It’s in line with some previous other work
that has also noted various forms of similar initializations worked well for RNNs. Now we want to understand how we train these
models. There are a variety of techniques that can
be used. This is a visualization of the training dynamics
of various algorithms on toy datasets where we’re trying to classify these red dots
from these blue dots. We only have a linear model so all it can
do is learn effectively a line separating these two. It can’t do it perfectly because there’s always
going to be values separating this. What we see is that the traditional most basic
optimizer is stochastic gradient descent whereas there are these various other improvements
and techniques. The main point of this example is to demonstrate
to not use Sgd effectively. Sgd very early on in training can look quite
similar but once the norm of your gradients becomes slower due to later stages optimization
you want some sort of dynamicism to your learning algorithm whereas Sgd once it gets out of
the very steep earlier areas of learning tends to slow down. This is particularly a problem oftentimes
in the space of text analysis because we have very sparse updates on words, for instance. There are rare words that you only see once
every thousand or 100,000 words and those words are very difficult to learn in a traditional
Sgd framework. Whereas these various techniques like momentum
and [INAUDIBLE] accelerated gradient what they do is effectively average together multiple
updates and accumulate those averages. They’re a form of smoothing out this stochastic
noise and accelerating directions of continuous updates. There’s another family of acceleration methods;
the adafamily that effectively scale the learning rate, the amount by which we update a parameter
given a gradient by some dynamics, some heuristics describing the local gradient. In the case of adagrad what we do is we accumulate
the norm of the gradients update seen so far with respect to a parameter and we scale our
learning rate. It’s a form of learning rate where we can
see that early on it learns quite quickly and later on it begins to slow down as it
reaches in this case near [INAUDIBLE]. Adadelta an RMS prop do something a little
bit like that but make it dynamic. It’s based on the local history instead
of the global history of the gradients for a parameter. There are a variety of optimizers and one
recently introduced called ‘Adam’ combines the early optimization speed that we saw in
that earlier example of adagrad with the better later convergence of various other methods
like adadelta and RMS prop. This looks quite good for text analysis in
RNN. We can see that Adam gets off to a very early
learning start just like adagrad. These results…actually there’s a slight
bug in my code for this so take them with a grain of salt but they still look good and
it’s a bug in the code so it might still be okay. That might actually explain one of the reasons
why we saw slightly worse generalization performance. It would train quite well but we would see
its performance on held out data might not have been as good for Adam because it learned
so much more quickly. We’re still looking into reasons why this
happens but in general, modern optimizers are essential on these kinds of problems. This just gives you a background on all the
various techniques for making RNNs more efficient in training and it can add quite a lot. Early on in learning we can see that Adam
and all these other techniques added together so this would be a just a standard gating
RNN. Again, if we had a simple RNN on here it would
look pretty linear. If we add gradient clipping to make it more
stable so we can use a slightly larger learning rate it begins to learn faster. If we add orthogonal initialization we can
see again that it began to learn faster and learn better. Finally, if we had only Adam we see another
huge gain over traditional Sgd. These add up. We can see that Adam and all these other techniques
are able to reach lower effective minima and are at least faster. Up to 10x faster. Admittedly these techniques add a little bit
of computation time so it might only be for instance 7.5x faster on a wall clock compared
to efficiency per rate update. This is interesting because now RNNs can actually
overfit quite a lot. As they continue to fit to training data for
instance their test data might plateau. We continue to improve on the training dataset
we’re given but this is called ‘overfitting’ where our RNN is effectively optimizing for
the details of the training data that aren’t true of new data. To combat this one of the techniques that
is used is called ‘early stopping’ which is each iteration of our dataset we will record
the train and test validation training test scores of these models and we will stop once
we notice that our test validation performances are improving. Oftentimes this is going to occur in your
first or second iteration through the dataset with all these various techniques together. That’s good news because oftentimes models
in this space can take ten, 50 or 100 iterations for your training data to converge. It seems in the case of RNNs we often overfit
after one or two epics through the data. To understand and get a better sense of how
these models can do we’re going to compare them to a much more standard technique in
the literature. We’re going to use the Fantastic Machine
Learning Library, SKLearn, and we’re going to use a standard linear model approach, a
traditional approach to text analysis. This would be using at TFIDFI vectorizer and
a linear model such as logistic regression. This is by no means meant to be the best model. In many cases, naïve Bayes SVN is actually
better than [INAUDIBLE] regression for classification for instance, but this is just a very easily
accessible, very easily comparable to technique To be fair, we’re going to use bigrams which
is a way of getting a little bit of structure into our data. Again, this way we could see ‘not good’
instead of just seeing the tokens for ‘not’ and ‘good’ occurring. We can get a little bit of structure which
might be useful in sentiment analysis. We’re going to use grid search to evaluate
potential [INAUDIBLE] for these linear models. We’re going to look at two which is minimum
document frequency which is way of controlling for the size of our input to our linear model. This would take tokens or words that appear
less than, in less than and many documents and would ignore them. If we see, for instance, the word ‘dinosaur’
and we’ve only seen it once in our dataset we’re going to ignore it effectively. Also we’re going to look at the regulization
coefficient which is a way of preventing overfitting for the new models. What we’re doing is grid search so we’re
looking at potential values for both of these. We’re not just explaining a potential performance
improvement based on poorly fitted parameters. Because these linear models tend to be faster
we are able to more effectively search over potential parameters. This is a fair way to get the linear model
potential advantage because they’re much faster so we can much more quickly search through
multiple values. Our second model we’re going to be looking
at is one of these recurrent neural networks. Admittedly, this is our own personal research:
take every result with a grain of salt. I’m using whatever I’ve tried that worked. The general message though is that using a
modern optimizer such as Adam, a gated recurrent unit, steeper sigmoid gates and orthogonal
initialization are good defaults. A medium-size model that can work quite well
is a 256 dimensional embedding and a 512 dimensional hidden representation. Then we put on whatever output we need: logistic
regression for binary sentiment classification, linear regression for predicting real values,
etc. It’s quite flexible because the RNN in its
core is a way of taking these sequences of values and converting them into a vector. Once we’ve got that vector we can put whatever
traditional model we want on top of it so long as it’s differential and open to gradient
based training. How does this work on datasets? What we see quickly here is that our linear
regression model does incredibly well for smaller datasets. When we have for instance only 1,000 or 10,000
training examples we see that the linear model outperforms the RNN by 50%, for instance. But what we notice that’s interesting as
our datasets get bigger the RNN tends to scale better until till later training into larger
dataset sizes. Because the RNN is admittedly a much more
complex model and operates on the sequences themselves ideally with more training data
it can learn a much better way to do the task at hand. Whereas your linear model because it’s operating
on unstructured bag of words and is just a linear model might eventually hit a wall where
it’s not able to do any better. You can imagine certain situations that you
just aren’t going to be able to classify the sentiment, positiveness or negativeness of
a text when it uses double negation, for instance. That’s one example with sentiment analysis. What’s also interesting is we see this replicated
for instance for predicting the helpfulness of a customer review. This is interesting because this is a much
more qualitative thing. Sentiment is as well but how helpful a user’s
review of a product is even more getting a much more abstract concept. We see again that as before with small amounts
of data the linear model, in this case reg since we’re predicting real values, does much
better but it doesn’t seem to scale and make use of more data as effectively as an RNN. This is interesting. We can see that RNNs seem to have poor generalization
properties with small amounts of data but they seem to be doing better when we have
large amounts of data. At one million labeled examples we can often
be between zero and 30% better than the equivalent linear model. Again these are just these examples with logistic
regression and linear regression but that crossover seems to be robust and somewhere
between 100,000 and a million examples but it is dependent on the dataset. There’s only one unfortunate caveat to this
approach which is it’s quite slow. For a million paragraph size text examples
to converge that linear model takes about 30 minutes on a single CPU core. For an RNN if we use a high-powered graphics
card such as the GTX 980 it takes about two hours. That’s not too bad. Our RNN on a proper high-end graphics card
is only about four times slower at a million examples to converge than the linear model. Again this is on a basic CPU core. But if we train our RNN on just that CPU core
it takes five days. This is unfortunate because this means our
RNN is about 250 times slower than a CPU and that’s just not going to cut it. This effectively is why we use GPUs in this
research. Here’s the cool part of the presentation. Again, an RNN when it’s being fed an input
sequence takes in the sequence and effectively learns a representation for each word. Each word gets replaced from its identifier
some value like ‘the’ is token 100 and gets replaced with a vector representation
that is learned by our model. These visualizations we’ll be showing you
are what happen when you look at what representations are learned by those models. What we’re going to do is use an algorithm
called ‘TSNE’ to visualize these embeddings that our RNN learns. What we’ve done to make it a little clearer
is this is the representations learned from training on only binary sentiment analysis. We’re trying to predict whether a given
customer review, for instance, likes a product or doesn’t like a product. What we’ve done is we’ve visualized these
representations in two dimensions using TSNE and we’ve colored each word by the average
sentiment of a review it appears in. What we see is a kind of axis. Again, it doesn’t correspond to any actual
axis aligned because it’s TSNE. But we see this continuum between very negative
words and very positive words. This isn’t too surprising. A model trained on sentiment analysis learns
to separate out negative and positive words. That’s what you’d expect to happen. We can take a little look at these very positive
and very negative clusters and see that it’s grouped into very understandable words like
‘useless, waste, poorly, disappointed’ as negative. You can see some interesting stuff where again
this visualization tries to group similar things close together. We can see that it’s actually identified
even though it is a very negative grouping, it’s also identified ‘returned, returning,
returns, return’ all together as well. That’s interesting because it seems to know
that ‘returned’ and return related words are very negative unsurprisingly if you find
them in a review. But it’s also separated them out slightly
from other more generic words. Then on the positive side we also see very
unsurprising indicators of happy if sentiments. So ‘fantastic, wonderful, and pleased’. But what’s even more interesting about this
model is that we see other forms of grouping and structure being learned. We see that it pulls out for instance, quantities
of time; weeks, months, hours, minutes. We also see that it pulls out qualifiers like
‘really, absolutely, extremely, totally’. Again, qualifiers are interesting because
they are by themselves neutral. They don’t necessarily indicate positive or
negative sentiment; instead they modify it. You can have ‘extremely good’ and ‘extremely
bad’. You see that being pulled out together. You also see product nouns, for instance things
that products could be, things that are products like movies, books, stories, items, devices
are also grouped together. Additionally, punctuation is grouped together. This is indicative potentially of our model
learning to use these kinds of data which again implies that our model may actually
be learning to use some of the structure present in the data. Punctuation by grouping it together and learning
similar representations for it imply that it’s finding some use for it. We would expect again punctuation to be quite
useful for segmenting out and separating out meanings and notions. Quantities of time are interesting. They are slightly negatively associated which
is understandable when you talk about, ‘this product took months to show up’ or, ‘it
worked for a total of an hour’. Again, grouping them all together implies
some use of it and the same thing with qualifiers. We have no true evidence at least in this
picture of these words being used but by learning similar representations and by having them
grouped together it implies it’s finding a use for them. We can extrapolate from there that it may
in fact be learning to use these words in natural ways for sentiment analysis. Again, this is learned purely from zero and
one binary indicator variables. This is a bit like seeing a sequence of numbers,
1,000, 2,000, 3046, five and then realizing that tokens five and one thousand are exclamation
point and period. They’re similar to tokens 2,000 and 7,000
which are comma and colon. This is a very strong result and very interesting
to see this kind of similarity being learned by our model. This is cool but how can we actually use these
models? We’re also presenting today a basic library
to allow developers to use these recurrent neural networks for text analysis. It’s called Passage and it’s a tiny library
built on top of the great Theano machine learning for framework [INAUDIBLE] math library. It’s incredibly alpha; we’re working on it
but it has a variety of features. We’re going to walk through now an example
of how to use Passage. This is Passage. It’s clonable via GitHub and it has a variety
of tools to make this useful. This is a little example we’re going to walk
through and explain real quickly on how we can use Passage to do analysis of text. We need to import the components that are
necessary. One of these is the tokenizer which is a way
of taking strings of text and separating them out into the individual tokens which would
be words and punctuation, for instance. A tokenizer can just be instantiated. It has a variety of parameters but has sensible
defaults. What we do is we emulate a SKLearn style interface. We can call fit transform on a body of training
text which would be again a list of strings for instance and that would return a list
of these training tokens which can be used natively by Passage to train RNN models. Additionally, we’re going to import the various
layers of a model. We have that embedding matrix we talked about,
the gated recurrent unit, and a dense output classifier. The way that we compose these into a training
model is by stacking them together in a list. Our input is one of these embedding matrices
and we’re going to set it to have 128 dimensions. We need to know how many of these features
to learn, how many of these tokens there are, and we’re going to just pull that out of how
many our tokenizer decided we needed. Then we’re going to use one of these gated
recurrent layers where in this case setting its size to 128. The sizes are sometimes smaller than you would
use for actual models and you can see better performance from larger models, for instance,
but these are small enough to be run on a CPU and not take forever. They’ll still take quite a while though. Finally, we have our dense output unit which
would be if we were doing binary sentiment classifications, detecting if it’s negative
or positive for a string of text, would be one unit because we’d be predicting one
value. We would use a sigmoid activation as a way
of quickly separating out negative and positive values. Then to make this model we instantiate it
through the model class which is just importable from Passage dot models RNN. We give it the layers we want to build our
custom architecture out of and we tell it what cost function we want to optimize. The cost function is the effective function
that lets us train this model. It’s just a way of telling the model how good
was this, how good did you do on this example, effectively. For binary classification we use binary [INAUDIBLE]
in this example. To train this model we just call a fit interface
which takes in training tokens which are made from training text and also takes in the training
labels we want to predict given those training texts. Then once that model has been trained…It
should be noted this only transfers one iteration through your dataset. As mentioned earlier, you may want to train
for multiple iterations if for instance your model hasn’t converged and you may want
to measure your performance on hold out data to know when to stop training a model if it
begins to overfit. Right now we’ve left that part to you but
we will be extending this to have interfaces to automatically do this. Finally, if you want to have your model then
predict on new data you can just call model dot predict on tokenizer dot transform or
test text and this will return how the model predicts new data. That’s an example of how to use Passage. To summarize, RNNs are now a potentially competitive
tool in certain situations for text analysis. Admittedly there are a lot of disclaimers
there but there seems to be a general trend that seems to be emerging which is if you
have a large, for instance, million-plus example dataset and you have a GPU they can look quite
good. They potentially can outperform linear models
and might not take all that much longer. But if you have a smaller dataset and don’t
have a GPU it can be very difficult to justify despite how cool these models might seem compared
to linear models. They are a lot slower, they have a lot of
complexity, a lot of different parts and a lot of different architectures you can change
and they seem to have poor generalization results with small datasets. Thanks for listening. If you have any questions you can let me know
at Alec at Indico dot I-O. Also if you’d like to see a more general introduction
to machine learning and deep learning in Python I have another video that you can check out
in the upper right, introducing that as a Python developer, how to use the awesome Theano
library to implement these algorithms yourself. Additionally if you’d like to check out and
learn more about Indico feel free to visit our website at Indico dot I-O where we have
various tools like Passage available for developers to use for machine learning. Thanks.

16 thoughts on “General Sequence Learning using Recurrent Neural Networks

  • February 21, 2015 at 6:40 am

    How did you regularize the RNN?

  • April 7, 2015 at 8:52 pm

    Alec, how on earth are you so succinct?  Your presentations are phenomenal!  I learn so much so quickly, and I eagerly await future videos from you and indico.

  • April 20, 2015 at 1:38 pm

    nice speech

  • May 25, 2015 at 10:02 pm

    Fantastic explanation of RNN use for text classification. Thanks!

  • June 18, 2015 at 6:06 am

    clear and useful tutorial, thanks!

  • October 27, 2015 at 7:08 pm

    Thank you, but in minute 3:00, the subtitle says "R&N" shouldn't be "RNN"?

  • November 13, 2015 at 2:18 am

    REALLY GREAT and comprehensive tutorial! thanks!

  • November 24, 2015 at 8:40 pm

    Very useful tutorial, many thanks.

  • December 23, 2015 at 8:40 am

    very nice tutorial, more clear than all the tutorials I've seen on the web, thanks!

  • March 2, 2016 at 6:47 am

    Great explanation! Although, it should be at least mentioned that minimizing training error does not necessarily converge towards real objective minima (aka generalization). The vertical axis in his plots (at least at 20:13) show "Training Loss", it might be that authors actually showed loss decreasing on a held out set.

  • March 27, 2016 at 1:38 am

    Is the TSNE on the word embeddings?

  • May 5, 2016 at 7:26 am

    i saw your video, it's clear and so helpful , thank you. Iwish if you could propose a code on Matlab for the RNN 🙂

  • May 16, 2016 at 9:59 pm

    Awesome explanation. These type of videos are highly needed as a starter before delving into more detailed papers. Create more of them on Deep NLP.

  • August 9, 2016 at 4:56 am

    Very clear visualization about how RNN works, thanks a lot!

  • August 9, 2016 at 7:41 am

    Good video as the starting to learn RNN. For the visualization of the words colored by average sentiment. What do you mean by average sentiment of one word? We can obtain the vector representation of each word after the training procedure for sentiment classification, but how to get the sentiment of each word for one sentence? Thanks!!

  • April 26, 2017 at 2:18 am

    I paused after watching half, and this is very clearly explained. Well done


Leave a Reply

Your email address will not be published. Required fields are marked *