Python Content Scraping With Beautifulsoup And Requests

Colab Playground Demo:

One question that comes to mind is, why do you want to scrape content from another website? * Populating application with content for startups. * Big data and data mining purposes * Search engines * Some web content providers just don't bother to provide APIs for their content, the only way you get to it is through scraping. * For the fun of it as a programmer!!

How does scraping work in Python?

1. Make a request using requests:

We start by sending a request to a given website server. Something to note is, the request may be rejected by the given website server. Facebook for example is notable for that. However, there are many ways to bypass that. That's not the topic of this tutorial though. let's keep it simple!!.

2. Read and filter through the given content using Beautifulsoup / HTML parser

Now that we have a successful request from the server, as you can see above, we have all the HTML content needed. But as you have noticed, the content is all clustered with HTML elements, this is where Beautifulsoup comes to work. Here is what Beautifulsoup does, Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping (crummy, N/A). Before we get started with Beautifulsoup, we need to view the page source(shown in the video) to figure out what elements, attributes hold the content we are interested in.

For our example(otcollect.com), we want to get all the displayed tutorial titles and links displayed on the homepage. Inspecting the webpage source, we find that the content is housed in a div element of attribute class content-timeline. Now, let's parse through the HTML and display the tutorial articles and links.

All the code for the tutorial:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# import request
import requests
# import beautifulsoup
from bs4 import BeautifulSoup
# Do a request to given url
re = requests.get("https://otcollect.com/")
# parse request content with beautifulsoup and html parse
soup = BeautifulSoup(re.content, "html5lib")
timeline = soup.find("div", attrs={"class": "content-timeline"})
# Do a for loop and display the needed content
for row in timeline.findAll("div", attrs={"class": "timeline-item"},):
          title = row.h3.text
          link  = f"https://otcollect.com{row.h3.parent['href']}"
          print(title)
          print(link)

Related Posts

0 Comments

12345

    00