Lonely Markov Chains


Drawing

I came out to a feeling of ease and a hat and we caught eyes and gentle smile. I told my buddy that you want a child for your number.

Overview

This tutorial creating a bot that generates Missed Connection craigslist postings.

The end result will be a program containing three main compenents:

  1. a web scraper to gather the corpus required to feed the Markov Chain
  2. The actual markov chain generator
  3. Integrating the bot with twitter and other platforms

Quick Intro on Markov Chains

Markov Chains are a special kind of Markov Process that generate psuedo-random sequences of items. The next item of the generated sequence is determined by the current state (item) and is said to be "memoryless". This means the process of selecting the next item of a sequence is repeated at every step, and the set of possible outcomes is determined by only the current state.

While the Markov Process can be applied to many types of sequences, our bot is going to be dealing with sentences (sequences of words).


Simple example

To give a very brief example, consider we have the following "data set" of 2 simple sentenses

My dog is happy and he likes to run. My cat is silly and he likes to eat. He is also very cute.

If we were to generate a Markov chain by hand, we would build a list of all the the unique items in this example sequence.

[my, dog, is, happy, and, he, likes, to, run, cat, silly, she, eat, also, very, cute].

To create new sequences from these items, and have the sequences be grammatially correct, we need to map each state to possible subsequent items.

Using the word "my", we see that "dog" and "cat" can both come next.

my: dog, cat
dog: is
is: happy, silly, also
happy: and
and: he
he: likes, likes, is
likes: to, to
to: run, eat
run: 'my'
cat: is
silly: and
eat: he
also: very
very: cute

With this state-mapping, we see that its allowed to say "is happy" or "to eat". However, according to our data set, it doesn't make sense to say "to is" or "silly he".

Additionally we see that, given the word "he", "likes" is twice as likely to occur than the word "is".

To construct a new sequence from our state mapping, we first select an item as the seed. The seed is simply the starting state from which our sequence will grow.

Let's choose "happy" as our seed:

happy

Now "randomly" select the next item from happy's following states

happy and

Repeat item selection for "and"

happy and he

And so, on:

happy and he likes to eat he likes to run my dog is also very cute

Wow! A psuedo-randomly generatd sentence!!

You may have noticed some problems with our little markov process:

  1. The words "run" and "eat" are the end of sentences within our data set. So their following words may not actually be related to their occurence.
  2. Our generated sequence will often run in a sort of loop once and spit out similar phrases. This is seen with multiple "he likes"
  3. It sounds stupid and obviously derivitive of our sample senteces!

We'll address how to solve these problems as we build our markov bot, but I assure you the end result will generate more enertaining sentences because we'll be using a much larger and diverse data set.

In part 2 of this tutorial we'll build a web scraper to create a large data set of missed connection posts. This set of text will give our bot a very intersting personality.