Scraping the Source for Information
From the previous page we have a variable source
that contains the source code to the website that we are trying to scrape. Let's now import Beautiful Soup that will help us traverse the source:
from bs4 import BeautifulSoup
Great, now we can use the module! Let's create a parser object that we will use to look through the source:
parser = BeautifulSoup(source,'html.parser')
parser
is pretty self-explanatory--it is an object that contains methods you can call to look through the webpage.
The quickest way to find something in our webpage is by using the tag name for the element. If we're looking for a table we use "table
", if we are looking for a link we use "a
", etc. Find all tag names here. The method find_all
returns all elements that have a certain tag name. Once we get the result of find_all
, we can check the class or text of our object, or look at it's children. Let's find all of the table cells on our page and print the text inside them:
for tableRow in parser.find_all("td"):
print(tableRow.text)
Great! Let's now find out how many previous versions of 15-112 there were before S17. Let's only look at cells that have the words "Previous versions" in their text:
for tableRow in parser.find_all("td"):
if ("Previous versions" in tableRow.text):
print(tableRow.text)
Let's now do some string formatting to find out how many courses there were:
for tableRow in parser.find_all("td"):
if ("Previous versions" in tableRow.text):
afterColon = tableRow.text[tableRow.text.find(":") + 1:]
afterColon = afterColon.strip()
count = len(afterColon.split(","))
print(count)
If you run this you should get 22
! At this point we've learned how to download the source of a website and then programmatically filter out information.
To learn more functionality of Beautiful Soup for your term project, please explore the documentation found here.