Downloading the Source of the Website
We learned how to view the source code of a website using our browsers. How do we do this in Python?
Create a new file and name it webscraping.py
and add the following code to it:
The requests module lets us download the source code of a website. The first thing we need to do is import it:
import requests
Now let's choose a website to scrape from:
url = "http://www.kosbie.net/cmu/spring-17/15-112/syllabus.html"
Great! Now let's ask the requests module to download the website:
website = requests.get(url)
And finally let's save the source into a variable. The website object we got above includes lots of information about the website we requested, but we only need the html code:
source = website.text
Let's print the source to see what we got:
print(source)
If everything went well, you should see a large blob of HTML code! This is all the content that is on the real website! If you're curious, you can copy this into a new file and save it as foo.html
, open it in your browser and see the real website without the stylings (CSS) or functionality (JS).