Python Content Scraping With Beautifulsoup And Requests
Colab Playground Demo:
One question that comes to mind is, why do you want to scrape content from another website? * Populating application with content for startups. * Big data and data mining purposes * Search engines * Some web content providers just don't bother to provide APIs for their content, the only way you get to it is through scraping. * For the fun of it as a programmer!!
How does scraping work in Python?
1. Make a request using requests:
We start by sending a request to a given website server. Something to note is, the request may be rejected by the given website server. Facebook for example is notable for that. However, there are many ways to bypass that. That's not the topic of this tutorial though. let's keep it simple!!.
2. Read and filter through the given content using Beautifulsoup / HTML parser
Now that we have a successful request from the server, as you can see above, we have all the HTML content needed. But as you have noticed, the content is all clustered with HTML elements, this is where Beautifulsoup comes to work. Here is what Beautifulsoup does, Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping (crummy, N/A). Before we get started with Beautifulsoup, we need to view the page source(shown in the video) to figure out what elements, attributes hold the content we are interested in.
For our example(otcollect.com), we want to get all the displayed tutorial titles and links displayed on the homepage.
Inspecting the webpage source, we find that the content is housed in a div
element of attribute class content-timeline
.
Now, let's parse through the HTML and display the tutorial articles and links.
All the code for the tutorial:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|