Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/13-Web-Scraping/00-Guide-to-Web-Scraping.ipynb
Views: 648
Guide to Web Scraping
Let's get you started with web scraping and Python. Before we begin, here are some important rules to follow and understand:
Always be respectful and try to get premission to scrape, do not bombard a website with scraping requests, otherwise your IP address may be blocked!
Be aware that websites change often, meaning your code could go from working to totally broken from one day to the next.
Pretty much every web scraping project of interest is a unique and custom job, so try your best to generalize the skills learned here.
OK, let's get started with the basics!
Basic components of a WebSite
HTML
HTML stands for Hypertext Markup Language and every website on the internet uses it to display information. Even the jupyter notebook system uses it to display this information in your browser. If you right click on a website and select "View Page Source" you can see the raw HTML of a web page. This is the information that Python will be looking at to grab information from. Let's take a look at a simple webpage's HTML:
Let's breakdown these components.
Every [removed] indicates a specific block type on the webpage:
CSS
CSS stands for Cascading Style Sheets, this is what gives "style" to a website, including colors and fonts, and even some animations! CSS uses tags such as id or class to connect an HTML element to a CSS feature, such as a particular color. id is a unique id for an HTML tag and must be unique within the HTML document, basically a single use connection. class defines a general style that can then be linked to multiple HTML tags. Basically if you only want a single html tag to be red, you would use an id tag, if you wanted several HTML tags/blocks to be red, you would create a class in your CSS doc and then link it to the rest of these blocks.
Scraping Guidelines
Keep in mind you should always have permission for the website you are scraping! Check a websites terms and conditions for more info. Also keep in mind that a computer can send requests to a website very fast, so a website may block your computer's ip address if you send too many requests too quickly. Lastly, websites change all the time! You will most likely need to update your code often for long term web-scraping jobs.
Web Scraping with Python
There are a few libraries you will need, you can go to your command line and install them with conda install (if you are using anaconda distribution), or pip install for other python distributions.
if you are not using the Anaconda Installation, you can use pip install instead of conda install, for example:
Now let's see what we can do with these libraries.
Example Task 0 - Grabbing the title of a page
Let's start very simple, we will grab the title of a page. Remember that this is the HTML block with the title tag. For this task we will use www.example.com which is a website specifically made to serve as an example domain. Let's go through the main steps:
This object is a requests.models.Response object and it actually contains the information from the website, for example:
Now we use BeautifulSoup to analyze the extracted page. Technically we could use our own custom script to loook for items in the string of res.text but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file). Using BeautifulSoup we can create a "soup" object that contains all the "ingredients" of the webpage. Don't ask me about the weird library names, I didn't choose them! 😃
Now let's use the .select() method to grab elements. We are looking for the 'title' tag, so we will pass in 'title'
Notice what is returned here, its actually a list containing all the title elements (along with their tags). You can use indexing or even looping to grab the elements from the list. Since this object it still a specialized tag, we cna use method calls to grab just the text.
Example Task 1 - Grabbing all elements of a class
Let's try to grab all the section headings of the Wikipedia Article on Grace Hopper from this URL: https://en.wikipedia.org/wiki/Grace_Hopper
Now its time to figure out what we are actually looking for. Inspect the element on the page to see that the section headers have the class "mw-headline". Because this is a class and not a straight tag, we need to adhere to some syntax for CSS. In this case
Syntax to pass to the .select() method
Match Results
soup.select('div')
All elements with the <div>
tag
soup.select('#some_id')
The HTML element containing the id
attribute of some_id
soup.select('.notice')
All the HTML elements with the CSS class
named notice
soup.select('div span')
Any elements named <span>
that are within an element named <div>
soup.select('div > span')
Any elements named <span>
that are directly within an element named <div>
, with no other element in between
Example Task 3 - Getting an Image from a Website
Let's attempt to grab the image of the Deep Blue Computer from this wikipedia article: https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)
You can make dictionary like calls for parts of the Tag, in this case, we are interested in the src , or "source" of the image, which should be its own .jpg or .png link:
We can actually display it with a markdown cell with the following:
Now that you have the actual src link, you can grab the image with requests and get along with the .content attribute. Note how we had to add https:// before the link, if you don't do this, requests will complain (but it gives you a pretty descriptive error code).
Let's write this to a file:=, not the 'wb' call to denote a binary writing of the file.
Now we can display this file right here in the notebook as markdown using:
Just write the above line in a new markdown cell and it will display the image we just downloaded!
Example Project - Working with Multiple Pages and Items
Let's show a more realistic example of scraping a full site. The website: http://books.toscrape.com/index.html is specifically designed for people to scrape it. Let's try to get the title of every book that has a 2 star rating and at the end just have a Python list with all their titles.
We will do the following:
Figure out the URL structure to go through every page
Scrap every page in the catalogue
Figure out what tag/class represents the Star rating
Filter by that star rating using an if statement
Store the results to a list
We can see that the URL structure is the following:
We can then fill in the page number with .format()
Now let's grab the products (books) from the get request result:
Now we can see that each book has the product_pod class. We can select any tag with this class, and then further reduce it by its rating.
Now by inspecting the site we can see that the class we want is class='star-rating Two' , if you click on this in your browser, you'll notice it displays the space as a . , so that means we want to search for ".star-rating.Two"
But we are looking for 2 stars, so it looks like we can just check to see if something was returned
Alternatively, we can just quickly check the text string to see if "star-rating Two" is in it. Either approach is fine (there are also many other alternative approaches!)
Now let's see how we can get the title if we have a 2-star match:
Okay, let's give it a shot by combining all the ideas we've talked about! (this should take about 20-60 seconds to complete running. Be aware a firwall may prevent this script from running. Also if you are getting a no response error, maybe try adding a sleep step with time.sleep(1).
** Excellent! You should now have the tools necessary to scrape any websites that interest you! Keep in mind, the more complex the website, the harder it will be to scrape. Always ask for permission! **