RNNs tutorial¶
In [1]:
# we assume that we have the dynet module in your path.
import dynet as dy
An LSTM/RNN overview:¶
An (1-layer) RNN can be thought of as a sequence of cells, \(h_1,...,h_k\), where \(h_i\) indicates the time dimenstion.
Each cell \(h_i\) has an input \(x_i\) and an output \(r_i\). In addition to \(x_i\), cell \(h_i\) receives as input also \(r_{i-1}\).
In a deep (multi-layer) RNN, we don’t have a sequence, but a grid. That is we have several layers of sequences:
- \(h_1^3,...,h_k^3\)
- \(h_1^2,...,h_k^2\)
- \(h_1^1,...h_k^1\),
Let \(r_i^j\) be the output of cell \(h_i^j\). Then:
The input to \(h_i^1\) is \(x_i\) and \(r_{i-1}^1\).
The input to \(h_i^2\) is \(r_i^1\) and \(r_{i-1}^2\), and so on.
The LSTM (RNN) Interface¶
RNN / LSTM / GRU follow the same interface. We have a “builder” which is in charge of creating definining the parameters for the sequence.
In [2]:
pc = dy.ParameterCollection()
NUM_LAYERS=2
INPUT_DIM=50
HIDDEN_DIM=10
builder = dy.LSTMBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
# or:
# builder = dy.SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
Note that when we create the builder, it adds the internal RNN
parameters to the ParameterCollection
. We do not need to care about
them, but they will be optimized together with the rest of the network’s
parameters.
In [3]:
s0 = builder.initial_state()
In [4]:
x1 = dy.vecInput(INPUT_DIM)
In [5]:
s1=s0.add_input(x1)
y1 = s1.output()
# here, we add x1 to the RNN, and the output we get from the top is y (a HIDEN_DIM-dim vector)
In [6]:
y1.npvalue().shape
Out[6]:
(10,)
In [7]:
s2=s1.add_input(x1) # we can add another input
y2=s2.output()
If our LSTM/RNN was one layer deep, y2 would be equal to the hidden state. However, since it is 2 layers deep, y2 is only the hidden state (= output) of the last layer.
If we were to want access to the all the hidden state (the output of
both the first and the last layers), we could use the .h()
method,
which returns a list of expressions, one for each layer:
In [8]:
print(s2.h())
(expression 47/0, expression 62/0)
The same interface that we saw until now for the LSTM, holds also for the Simple RNN:
In [9]:
# create a simple rnn builder
rnnbuilder=dy.SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
# initialize a new graph, and a new sequence
rs0 = rnnbuilder.initial_state()
# add inputs
rs1 = rs0.add_input(x1)
ry1 = rs1.output()
print("all layers:", s1.h())
all layers: (expression 19/0, expression 32/0)
In [10]:
print(s1.s())
(expression 17/0, expression 30/0, expression 19/0, expression 32/0)
To summarize, when calling .add_input(x)
on an RNNState
what
happens is that the state creates a new RNN/LSTM column, passing it: 1.
the state of the current RNN column 2. the input x
The state is then returned, and we can call it’s output()
method to
get the output y
, which is the output at the top of the column. We
can access the outputs of all the layers (not only the last one) using
the .h()
method of the state.
``.s()`` The internal state of the RNN may be more involved than
just the outputs \(h\). This is the case for the LSTM, that keeps an
extra “memory” cell, that is used when calculating \(h\), and which
is also passed to the next column. To access the entire hidden state, we
use the .s()
method.
The output of .s()
differs by the type of RNN being used. For the
simple-RNN, it is the same as .h()
. For the LSTM, it is more
involved.
In [11]:
rnn_h = rs1.h()
rnn_s = rs1.s()
print("RNN h:", rnn_h)
print("RNN s:", rnn_s)
lstm_h = s1.h()
lstm_s = s1.s()
print("LSTM h:", lstm_h)
print("LSTM s:", lstm_s
)
RNN h: (expression 70/0, expression 72/0)
RNN s: (expression 70/0, expression 72/0)
LSTM h: (expression 19/0, expression 32/0)
LSTM s: (expression 17/0, expression 30/0, expression 19/0, expression 32/0)
As we can see, the LSTM has two extra state expressions (one for each hidden layer) before the outputs h.
Extra options in the RNN/LSTM interface¶
Stack LSTM The RNN’s are shaped as a stack: we can remove the top
and continue from the previous state. This is done either by remembering
the previous state and continuing it with a new .add_input()
, or
using we can access the previous state of a given state using the
.prev()
method of state.
Initializing a new sequence with a given state When we call
builder.initial_state()
, we are assuming the state has random /0
initialization. If we want, we can specify a list of expressions that
will serve as the initial state. The expected format is the same as the
results of a call to .final_s()
. TODO: this is not supported yet.
In [12]:
s2=s1.add_input(x1)
s3=s2.add_input(x1)
s4=s3.add_input(x1)
# let's continue s3 with a new input.
s5=s3.add_input(x1)
# we now have two different sequences:
# s0,s1,s2,s3,s4
# s0,s1,s2,s3,s5
# the two sequences share parameters.
assert(s5.prev() == s3)
assert(s4.prev() == s3)
s6=s3.prev().add_input(x1)
# we now have an additional sequence:
# s0,s1,s2,s6
In [13]:
s6.h()
Out[13]:
(expression 207/0, expression 222/0)
In [14]:
s6.s()
Out[14]:
(expression 205/0, expression 220/0, expression 207/0, expression 222/0)
Aside: memory efficient transduction¶
The RNNState
interface is convenient, and allows for incremental
input construction. However, sometimes we know the sequence of inputs in
advance, and care only about the sequence of output expressions. In this
case, we can use the add_inputs(xs)
method, where xs
is a list
of Expression.
In [15]:
state = rnnbuilder.initial_state()
xs = [x1,x1,x1]
states = state.add_inputs(xs)
outputs = [s.output() for s in states]
hs = [s.h() for s in states]
print(outputs, hs)
[expression 226/0, expression 230/0, expression 234/0] [(expression 224/0, expression 226/0), (expression 228/0, expression 230/0), (expression 232/0, expression 234/0)]
This is convenient.
What if we do not care about .s()
and .h()
, and do not need to
access the previous vectors? In such cases we can use the
transduce(xs)
method instead of add_inputs(xs)
. transduce
takes in a sequence of Expression
s, and returns a sequence of
Expression
s. As a consequence of not returning RNNState
s,
trnasduce
is much more memory efficient than add_inputs
or a
series of calls to add_input
.
In [16]:
state = rnnbuilder.initial_state()
xs = [x1,x1,x1]
outputs = state.transduce(xs)
print(outputs)
[expression 238/0, expression 242/0, expression 246/0]
Character-level LSTM¶
Now that we know the basics of RNNs, let’s build a character-level LSTM language-model. We have a sequence LSTM that, at each step, gets as input a character, and needs to predict the next character.
In [17]:
import random
from collections import defaultdict
from itertools import count
import sys
LAYERS = 1
INPUT_DIM = 50
HIDDEN_DIM = 50
characters = list("abcdefghijklmnopqrstuvwxyz ")
characters.append("<EOS>")
int2char = list(characters)
char2int = {c:i for i,c in enumerate(characters)}
VOCAB_SIZE = len(characters)
In [18]:
pc = dy.ParameterCollection()
srnn = dy.SimpleRNNBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
lstm = dy.LSTMBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
# add parameters for the hidden->output part for both lstm and srnn
params_lstm = {}
params_srnn = {}
for params in [params_lstm, params_srnn]:
params["lookup"] = pc.add_lookup_parameters((VOCAB_SIZE, INPUT_DIM))
params["R"] = pc.add_parameters((VOCAB_SIZE, HIDDEN_DIM))
params["bias"] = pc.add_parameters((VOCAB_SIZE))
# return compute loss of RNN for one sentence
def do_one_sentence(rnn, params, sentence):
# setup the sentence
dy.renew_cg()
s0 = rnn.initial_state()
R = params["R"]
bias = params["bias"]
lookup = params["lookup"]
sentence = ["<EOS>"] + list(sentence) + ["<EOS>"]
sentence = [char2int[c] for c in sentence]
s = s0
loss = []
for char,next_char in zip(sentence,sentence[1:]):
s = s.add_input(lookup[char])
probs = dy.softmax(R*s.output() + bias)
loss.append( -dy.log(dy.pick(probs,next_char)) )
loss = dy.esum(loss)
return loss
# generate from model:
def generate(rnn, params):
def sample(probs):
rnd = random.random()
for i,p in enumerate(probs):
rnd -= p
if rnd <= 0: break
return i
# setup the sentence
dy.renew_cg()
s0 = rnn.initial_state()
R = params["R"]
bias = params["bias"]
lookup = params["lookup"]
s = s0.add_input(lookup[char2int["<EOS>"]])
out=[]
while True:
probs = dy.softmax(R*s.output() + bias)
probs = probs.vec_value()
next_char = sample(probs)
out.append(int2char[next_char])
if out[-1] == "<EOS>": break
s = s.add_input(lookup[next_char])
return "".join(out[:-1]) # strip the <EOS>
# train, and generate every 5 samples
def train(rnn, params, sentence):
trainer = dy.SimpleSGDTrainer(pc)
for i in range(200):
loss = do_one_sentence(rnn, params, sentence)
loss_value = loss.value()
loss.backward()
trainer.update()
if i % 5 == 0:
print("%.10f" % loss_value, end="\t")
print(generate(rnn, params))
Notice that: 1. We pass the same rnn-builder to do_one_sentence
over
and over again. We must re-use the same rnn-builder, as this is where
the shared parameters are kept. 2. We dy.renew_cg()
before each
sentence – because we want to have a new graph (new network) for this
sentence. The parameters will be shared through the model and the shared
rnn-builder.
In [19]:
sentence = "a quick brown fox jumped over the lazy dog"
train(srnn, params_srnn, sentence)
149.1241455078 czdz
106.8910675049 vvaleamkxnmgiehjqziydnqw wvdyqvocv
70.4573440552
45.8460693359 k upcel romw ufec dbo tzhw
23.2288417816 p g
10.3553504944 d slick boxwn xox
3.4879603386 a quick browncfox jumped oher dhe lazy dog
1.2615398169 a quick brown fox jumped over the lazy dog
0.8019771576 q quick brown fox jumped kver the lazy dog
0.5865874290 a quick brown fox jumped over the lazy dog
0.4615180790 a quick brown fox jumped over the lazy dog
0.3798972070 a quick brown fox jumpgd over the lazy dog
0.3224956095 a quick brown fox jumped over the lazy dog
0.2799601257 a quick brown fox jumped ovec the lazy dog
0.2471984029 a quick brown fox jumped over the lazy dog
0.2212039977 a quick brown fox jumped over ehe lazy dog
0.2000847459 a quick brown fox jumped over the lazy dog
0.1825926602 a quick brown fox jumped over the lazy dog
0.1678716689 a quick brown fox jumped over the lazy dog
0.1553139985 a quick brown fox jumped over the lazy dog
0.1444785446 a quick brown fox jumped over the lazy dog
0.1350349337 a quick brewn fox jumped over the lazy dog
0.1267340034 a quick brown fox jumped over the lazy dog
0.1193781197 a quick brown fox jumped over the lazy dog
0.1128196493 a quick brown fox jumped over the lazy dog
0.1069323123 a quick brown fox jumped over the lazy dog
0.1016217321 a quick brown fox jumped over the lazy dog
0.0968040079 a quick brown fox jumped over the lazy dog
0.0924172848 a quick brown fox jumped over the lazy dog
0.0884065777 a quick brown fox jumped over the lazy dog
0.0847217217 a quick brown fox jumped over the lazy dog
0.0813286975 a quick brown fox jumped over the lazy dog
0.0781916752 a quick brown fox vumped over the lazy dog
0.0752858222 a quick brown fox jumped over the lazy dog
0.0725848153 a quick brown fox jumped over the lazy dog
0.0700684935 a quick brown fox jumped over the lazy dog
0.0677180886 a quick brown fox jumped over the lazy dog
0.0655169189 a quick brown fox jumped over the lazy dog
0.0634531230 a quick brown fox jumped over the lazy dog
0.0615142360 a quick brown fox jumped over the lazy dog
In [20]:
sentence = "a quick brown fox jumped over the lazy dog"
train(lstm, params_lstm, sentence)
143.8308105469 rsfmqxt ozqsiiaqa
128.6366729736 v ypvoobknwepfeply
121.5253829956 ymquqr wfmocoe ovwuwfmhdm ueod yewe
115.6775283813 q thluk nwzwz eoodzod
109.5644912720 kxni bpuch xj enu mr ung omj dp eevem r vyyd t p lt oyqbr
102.8272857666 bqkbhfmnb o mppoeoegoyt ddl rusay l da
98.4713058472 qosqhn poafr of uhexedoo pe h etavopyd pyiy d o yee al slghh
90.9259567261 a a qakmbn bm qcoayper efoyeroldpddm
85.2885818481 jku u upoowwbj jvdemmdfdeduree ood oogdogpq dlto y agzog i g gdlzac fokn ux po opu uvlrr e eer rae ed ogy oel olzz
76.7979202271
71.5208969116 q bauqn kkowcon ffojjpfp ox ouvtt e lzuy dv hoty dgggo oqjgkgo oonx oxm om vee eeo o ad
62.6548461914 dc zbrb oqn xomper joehpee eztlazd lqau
56.1585731506 howunqm oofw ojpder re rezt ogavyy dogdcwo
49.3944778442 t bc qouwr rw o bo xm ojumer r ree ele azlad do
41.5289344788 h uuikr ob wox jumepd rr loy ulz do
36.3642997742 dw ucown oox mx op jadmee ther loh
30.3189773560 a qqucakb onw fn jumjee oe tee taza dzo
24.9423580170 a quackk bborn fox juumpedde ove verr azy dogg
19.8645935059 aauk qbrrr oorf fmomed oee the layy oys
15.7765054703 aukc irr brow fox jumedd oveer aauzy dog
12.4693098068 a quikc brown foo jope dve ovr lay dogy dbog
9.8480806351 ucc brown fox juxmpe ooer the llazy dog
7.8634152412 a quiic brronn fox jumed over the lazy dog
5.9515495300 a quick brown x jxx jjumed over the lzyy log
4.5509667397 a quic bronn fox jumpdd over the lazy dog
3.4462499619 a quic brown fox jumped oper the lazy dog
2.2497565746 a qgiick brown fox jumped over the lazy dog
1.7854881287 a quicr brown fox jumped over the lazy dog
1.4716305733 a quick brown n fumpped over the lazy dog
1.2449830770 a quick brown fox jumped over th lazy dog
1.0744248629 a quick brown fox jumped over the lazy dog
0.9419770241 a qick brown fox jumped over the lazy dog
0.8365104198 a quik brown fox jumped over the lazy dog
0.7507840395 a quick brown fox jumped over the lazy dog
0.6798918843 a quick brown fox jumped over the lazy dog
0.6204041839 a quick brown fox jumped over the lazy dog
0.5698559284 a quick brown fox jumped over the lazy dog
0.5264287591 a quick brown fox jumped over the lazy ddog
0.4887620807 a quick brown fox jumped over the lazy dog
0.4558122456 a quick brown fomx jumped over the lazy dog
The model seem to learn the sentence quite well.
Somewhat surprisingly, the Simple-RNN model learn quicker than the LSTM! (If you increase the number of layers, this difference will be even more pronounced)
How can that be?
The answer is that we are cheating a bit. The sentence we are trying to learn has each letter-bigram exactly once. This means a simple trigram model can memorize it very well.
Try it out with more complex sequences.
In [21]:
train(srnn, params_srnn, "these pretzels are making me thirsty")
239.8676452637 a quick brown fox jumped over the lazy dog
160.0521697998 a quick lrojny la tumped over the lazy driwn lrlazthe lazmzover the lazy dvgrn zrldnthd qheed gex thed bmne the lazy dbqre lazd yrd nrkwn fox jumped over the lazy dhe lazy dog
95.4504623413 a quick brzy e lazy e greaze erg
63.7105903625 ahqne lretze n mve lbmed tizd mare a
40.0441474915 aaquicpr the laz
21.2409267426 aze ecpretztli arirltzixtoped bt
7.8475003242 these pritzels pre maling me thirsty
2.9435257912 these pretz ls whe taking me thirsty
1.0504565239 these pretzels are making me thirsty
0.4977459908 these pretzels are making me the mty
0.3605296910 these pretz
0.2839964628 these pretzels are making me ghirsty
0.2345837206 these pretzels are making me thirsty
0.1999202222 these pretzels are making me thirsty
0.1742218733 these pretzels are making me thirsty
0.1543948352 these pretzels are making me thirsty
0.1386296600 these pretzels are making me thirsty
0.1257905215 these pretzels are making me thirsty
0.1151323095 these pretzels are making me thirsty
0.1061392874 tmese pretzels are making me thirsty
0.0984523371 these pretzels are making me thirsty
0.0918025821 these pretzels are making me thirsty
0.0859967992 these pretzels are making me thirsty
0.0808787867 these pretzels are making me thirsty
0.0763375387 these pretzels are making me thirsty
0.0722793266 these pretzels are making me thirsty
0.0686300844 these pretzels are making me thirsty
0.0653314814 these pretzels are making me thirsty
0.0623353273 these pretzels are making me thirsty
0.0596007779 these pretzels are making me thirsty
0.0570969619 these pretzels are making me thirsty
0.0547946990 these pretzels are making me thirsty
0.0526688434 these pretzels are making me thirsty
0.0507033207 these pretzels are making me thirsty
0.0488782115 these pretzels are making me thirsty
0.0471798219 these pretzels are making me thirsty
0.0455954187 these pretzels are making me thirsty
0.0441129319 these pretzels are making me thirsty
0.0427242145 these pretzels are making me thirsty
0.0414196625 these pretzels are making me thirsty