LSTM Adventures

Following this as a guide: http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

1: Started by training with the single-layer LSTM

model = keras.models.Sequential() 
model.add(kl.LSTM(256, input_shape=(X.shape[1], X.shape[2]))) 
model.add(kl.Dropout(0.2)) 
model.add(kl.Dense(y.shape[1], activation="softmax")) 
model.compile(loss="categorical_crossentropy", optimizer="adam")

results:

  • final loss after 20 epochs: 1.9054
  • time to train (approx): 1hr
  • mostly gibberish words, but word lengths and spacing look right!
  • can handle opening and closing quotation marks
  • sometimes gets ” ‘?’, said the . “

the mort of the sorts!’ she katter wept on, ‘and toene io the doer wo thin iire.’

‘io she mo tee toete of ther ’ou ’ould ’ou ’ould toe tealet ’our majesty,’ the match hare seid to tee jury, the was aoling to an the sooeo.

‘he d crust bi iele at all,’ said the mock turtle.

‘ie doersse toer miter ’hur paae,’ she mick turtle replied, ‘in was a little soiee an in whnl she firl th the kook an in oare

the rieet hor ane the was so tea korte of the sable, bnt the hodrt was no kently and shint oo the gan on the goor, and whnn hes lene the rueen sas so aeain, and whnn she gad been to fen ana tuiee oo thin shaee th the crrr asd then the was no toeeen at the winte tabbit wat she wiite rabbit wat she mitt of the gareen, and she woole tee whst her al iere at she could,

*note: I tried this again with a 512 unit single layer LSTM, the results were along these lines:

‘i con’t sein mo,’ said alice, ‘ho would ba au all aare an ierse then and what io the bane af in eine th the bare aadi in the cinrsn, and she was soit ion hor horo aerir the was oo the cane and the was oo the tanl oa teing the was soie in her hand

and saed to herself, ‘oh tou dane the boomer ’fth the semeen at the soiee tf the bareee and see seat in was note go the pine afd roe banl th the grore of the grureon,

Training time was slightly longer, and the results are not meaningfully different. One observation here is that training for 20 epochs might be too short…

2:  Added another LSTM layer

model = keras.models.Sequential()
model.add(kl.LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(kl.Dropout(0.2))
model.add(kl.LSTM(256))
model.add(kl.Dropout(0.2))
model.add(kl.Dense(y.shape[1], activation="softmax"))
model.compile(loss="categorical_crossentropy", optimizer="adam")

results:

  • final loss after 20 epochs: 1.5144
  • time to train (approx): 2hrs
  • seems to fall into loops
  • closer to ‘english’ words

and mowte it as all, and i dan see the that was a little boom as the soog of the sable it as all. and the pueen was a little boowle and the thing was she was a little boowle and the that was a little boowle and the that was a little boowse and the thing was the white rabbit say off and the thing was she was a little boowle and the that was a little boowle and the that was a little boowse and the thing was the white rabbit say off and the thing was she was a little boowle and the that was a little boowle and the that was a little boowse and the thing was the white rabbit say off and the thing was she was a little boowle and the that was a little boowle

3: Trying again with 3 LSTM layers

model = keras.models.Sequential()
model.add(kl.LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(kl.Dropout(0.2))
model.add(kl.LSTM(256, return_sequences=True))
model.add(kl.Dropout(0.2))
model.add(kl.LSTM(256))
model.add(kl.Dropout(0.2))
model.add(kl.Dense(y.shape[1], activation="softmax"))
model.compile(loss="categorical_crossentropy", optimizer="adam")

results:

  • final loss after 20 epochs (took much longer!):
  • time to train (approx): 2.5 hrs
  • real words, better sentence structure… but it’s also falling into pretty small loops.
  • Different seed text results in the first line being quite different but then quickly devolving into the same loop of the hatter and the mock turtle going back and forth.
at the way that the was to tee the court.

‘i should like the dormouse say “ said the king as he spoke. 
‘i don’t know what it moog and she beginning of the sea,’ the hatter went on, ‘i’ve sat a little sable. 
‘i don’t know what you con’t think it,’ said the mock turtle.

‘i don’t know what it moog and she beginning of the sea,’ the hatter went on, ‘i’ve sat a little sable. 
‘i don’t know what you con’t think it,’ said the mock turtle.

‘i don’t know what it moog and she beginning of the sea,’ the hatter went on, ‘i’ve sat a little sable. 
‘i don’t know what you con’t think it,’ said the mock turtle.

4: Single Layer GRU (for comparison with the single layer LSTM)

model = keras.models.Sequential()
model.add(kl.GRU(256, input_shape=(X.shape[1], X.shape[2])))
model.add(kl.Dropout(0.2))
model.add(kl.Dense(y.shape[1], activation="softmax"))
model.compile(loss="categorical_crossentropy", optimizer="adam")

results:

  • final loss (20 epochs): 1.87897 (approx the same as the 1 layer LSTM)
  • time to train: 40 minutes (faster!)

tone of the hoore of the house, and the whrt huon a little crrree the sabbit say an in oish of the was soiereing to be a gooa tu and gerd an all, and she whrt huow foonnge to teyerke to herself, ‘it was the tait wuinten thi woils oottle goos,’

‘hht ser what i cen’t tamling,’ said alice, ‘ih’ le yhit herseree to teye then.’

‘i shanl sht me wiitg,’ said alice,

‘what ie y sheu wasy loteln, and teed the porers oaser droneuse toe thet, and the tored harden wo cnd too fuor of the word the was soteig of the sintle the white rabbit, and the tored hard a little shrenge the fab foott of the was sotengi th theee th the was soe tabdit and ger fee toe pame oo her loot ou teie the was soee i rher to thi kittle gorr, and the tar so tae it the lad soe th the waited to seterker

Still gibberish, though more english words. Training for more than 20 epochs might be the key!

Here’s a graph of the loss dropoff over time

alice_loss.png

 

Advertisements

Cube, the chat bot

My most recent project (besides surviving the 2013 Summer of Math) has been writing a chat bot for the room my friends and I hang out in. It’s called Cube, and is written in python (using sleekxmpp).

Here’s a github link

Cube is a markov chain, pseudo-random text generating bot who can say some pretty ridiculous things.

A markov chain takes input text – in this case, directly from our human speech in the chat room – and ‘slices’ it up into key-value pairs, where the key is a 2-word tuple, and points to a single word value. Cube appends END tokens so we know where to start and stop generating, and stores the resulting python dict.

For example if someone were to say “The quick brown fox”, it would become:

END the -> quick
the quick -> brown
quick brown -> fox
brown fox -> END

Over time, there’s overlap in our speech, and these are added to the dict. For instance, if someone were to then say “The quick brown cow”, our updated dict would have

END the -> quick
the quick -> brown
quick brown -> {fox, cow}
brown fox -> END
brown cow -> END

When prompted to generate a sentence, we starts a markov chain with a random key where the first token is END, and follow it until we encounter another END token, two words at a time. If there are multiple values for a given key tuple, we choose at random. Each new word is appended to the list of words that will comprise our bot-generated text.

It’s worth mentioning at this point that the probability of a given next word is stored to the dict as well. For instance if we had “I am fat”, “I am fat”, and “I am hungry”, the values stored would be:

I am -> {fat, fat, hungry}

So there would be a 66% chance of selecting “fat”, and a 33% chance of selecting hungry.

That’s the basic idea behind markov chains. Cube however, does this slightly differently.

When a new sentence is input, Cube actually saves two markov chains. One forwards (like we’ve covered), and one in reverse. So the sentence “the quick brown fox” becomes the forward dict from above, as well as:

END fox -> brown
fox brown -> quick
brown quick -> the
quick the -> END

These two markov chains, forward and reverse, are stored in two separate dicts, which makes this whole model a little trickier to visualize, but bear with me. Now, when we go to generate a sentence, we can start from any word in the corpus and run a markov chain in both directions, until each reaches an END token. This allows for much greater variance in generated text, because there are obviously many more words in the corpus than there are words-preceded-by-END-tokens.

Over time, Cube starts to say some very interesting things indeed.

viraj: cube
cube: the agent started taking out his badge and had anyone used the criticisms box to agree with your fingers