- 14 Aug 2017
- python, markov
- #Tutorials
Scraping craislist to build a corpus
This is part two of the Loneley Markov series. In part one. I gave a brief example of markov chains and how they work. In this part we'll set up a web scraper using the requests and beautifulsoup libraries to gather Missed Connections posts to build up a corpus of text to feed our bot.
Before we build our bot, it is going to to need a selection of text which will be used to "train" it. We will be referring to the selection of text as the corpus. The corpus essentially provides the bot with a set possible words and word combinations that are found within real world, human produced text. The larger the corpus the more realistic and human our bot will appear.
Luckily for us, there are tons of Missed Connection posts. And tons of different geographical regions to choose from. So lets get to it!
Although we'll be using this scraper to download Missed Connection posts, it could also be used to scrape any oher types of post from craigslist.
Setting up Project Workspace
Use a virtual environment
It's always a good idea to use a virtualenv so that any libraries we use don't mess things up on our system python installation.
We'll use virtualenv here, but any of the available libraries will do
If you haven't install virtualenv:
pip install virtualenv
Create the virtualenv:
virtualenv lonely-markov
Activate the virtualenv
source lonely-markov/bin/activate
Dependencies
We'll need requests to scrape and downlaod the craigslist html. And we'll use beautifulsoup to parse the content from the html to suit our needs.
pip install requests beautifulsoup4
Project Contents
Let's set up our directory and file structure so that we're all on the same page.
imposter
├── __init__.py
├── outputs/ --> Bot generated content
├── resources/ --> Corpus files
├── config.py
├── scraper.py --> Scrapes CL for corpus content
├── main.py
├── markov.py --> generates content using corpus
└── secrets.py --> API keys, etc...
In config.py we'll set up some variables to keep track of our key directory locations:
import os
PROJECT_ROOT = os.path.dirname(os.path.abspath(__file__))
RESOURCES = os.path.join(PROJECT_ROOT, 'resources')
CORPUS_FILES= os.path.join(RESOURCES, 'corpus_files')
OUPUTS = os.path.join(PROJECT_ROOT, 'outputs')
BOTS_DIR = os.path.join(IMPOSTER_DIR, 'bots')
LYRICS_DIR = os.path.join(CORPUS_FILES_DIR, 'lyrics')
SCRAPED_URLS_DIR = os.path.join(RESOURCES, 'scraped_urls')
Creating the Web Scraper
While we could implement the scraper as a simple script, let's create a class. This will allow us to provide some options that can easily be cutomized for how our scraper should function. We'll start things off really simple, something that is small enough to be a function, but we'll see that a class is more ideal as the functionality grows.
If you aren't familiar with the requests library (or any of Kenneth Reitz's other projects) you should introduce yourself. It offers a simple, clean API to make HTTP requests as simple as calling get()
, post()
, etc...
BeautifulSoup is the library we'll be using to parse out our desired content from the html that requests will fetch for us.
Let's get it started!
#scraper.py
import requests
from bs4 import BeautifulSoup as bs
class CraigScraper():
def __init__(self):
Fetching the content
Craigslist does not offer a public API, so we won't be able to just ask for every page and all its data. However the simplicity of the site makes it quite easy to to fetch desired html content and parse out the information.
We'll start by fetching the search/index page for a given category and location. To do so we'll need the search url for the query. Let's use the Missed Connections in Los Angeles.
If you select Los Angeles as the city and navigate to the Missed Connections category you'll see the URL is https://losangeles.craigslist.org/search/mis
.
Perfect, we'll just send a get
request
class CraigScraper():
def __init__(self):
def get_mis_index(self):
res = requests.get('https://losangeles.craigslist.org/search/mis')
res.raise_for_status()
return res.text
But wait, what if we want to change the city to scrape from? Well, if navigate to the same category for New York, you'll see the url is https://newyork.craigslist.org/search/mis
.
Aha! It's no REST API but the URL pattern is good enough for our simple task. Instead of hardcoding the url, let's have our CraigScraper fetch the index for whatever city and category we want.
class CraigScraper():
def __init__(self, city: Str, category: Str):
self.city = city
self.category = category
self.city_url = 'https://{}.craigslist.org'.format(self.city)
self.index_url = '{}/search/{}'.format(self.city_url, self.category)
def get_index_page(self) -> str:
res = requests.get(self.index_url)
res.raise_for_status()
return res.text
And to quickly test it out, add tot he end of your file:
if __name__ == __main__:
scraper = CraigScraper('losangeles', 'mis')
text = scraper.get_index_page()
print(text)
Running the program should result in a huge wall of raw html.
Parsing the HTML
So now we have the raw html to the Missed Connection search page. This page shows a list of links to individual posts. In order to actually access and use those links or any other information we might need we'll need to parse the html. Luckily for us, beautifulsoup makes it very simple (almost as simple as requests).
BeautifulSoup has some great documentation, but the general approach to parsing html goes like this:
from bs4 import BeautifulSoup as bs
soup = bs(html_text, 'html.parser')
We can then select tags and their attributes
# find all links
links = soup.find_all('a')
# get the link text and url
for i in links:
text = link['text']
url = link['href']
# select all links ib body using CSS selector
body_links = soup.select('body > a')
Currently, the get_index_page method returns raw html. We wont really have a need to deal with raw html, as we'll be parsing out information using beautifulsoup4. Let's instead create a method to get the a soup object of a url's html content.
def get_soup(self, url: Str) -> Soup:
resp = requests.get(url)
resp.encoding = 'UTF-8'
resp.raise_for_status()
return bs(page.text, 'html.parser')
We'll use this method whenever we need a pages content. And since it's now a simple function call let's turn our get_index_page
into a property.
@property
def index_page(self):
return self.get_soup(self.index_url)
Your code should now look like this
import requests
from bs4 import BeautifulSoup as bs
class CraigScraper():
def __init__(self, city, category):
self.city = city
self.category = category
self.city_url = 'https://{}.craigslist.org'.format(self.city)
self.index_url = '{}/search/{}'.format(self.city_url, self.category)
def get_soup(self, url: Str) -> Soup:
resp = requests.get(url)
resp.encoding = 'UTF-8'
resp.raise_for_status()
return bs(page.text, 'html.parser')
@property
def index_page(self):
return self.get_soup(self.index_url)
And change the main function to call
if __name__ == '__main__':
scraper = CraigScraper('losangeles', 'mis')
soup = scraper.index_page
print(soup)
Running the current code with python scraper.py
, you'll see an output that appears to be just html, but it's actually a BeautifulSOup object that will allow us to parse out the information we want.
Available on the search results page (which is the idnex of a category) is a list of links to other posts.
If you open dev chrome tools and inspect the first post you'll see that the title is a link within a `
' which is inside a list item. So let's select all the psot items using our soup object.
```python
def index_urls(self): for post in self.index_page.select('li > p > a'):