DATA

Web Scraping with Beautiful Soup

Beautiful Soup is a Python library, used for pulling HTML & XML data out of websites straight into your code.

Mariya Sha

03 Jun 2020 — 2 min read

Beautiful Soup is a Python library, used for pulling HTML & XML data out of websites straight into your code.

What can we do with Beautiful Soup?

Beautiful Soup allows us to hand-pick specific elements on the web page, such as: <div>,<ul>,<table>, <a>, and other tags.
It allows us to target elements with specific attributes, for example: targeting all the <div class=”main”> elements, or the elements with <img width=300>.
It also provides us with handy functions such as: targeting text or hyperlinks within a given element.

Code example:

We can easily extract the entire source code of a web page with the following commands:import urllib.request
from bs4 import BeautifulSoup as bs#load the web page(replace my_url.com)
my_url = urllib.request.urlopen('http://my-url.com')#pull the HTML out of the web page
soup = bs(my_url)print('extracted HTML code:\n', soup)

What can we do with web scraping?

Web scraping comes handy where COPY and PASTE cannot present a sufficient-enough solution — for example, when creating databases.
It also might be a good idea when dealing with information that updates frequently like stock market exchange or news headlines.

Building a database with Beautiful Soup

Databases are very important components in Python, yet we rarely get to collect and arrange them on our own.
The video below will show you how to build a CSV database from scratch with the help of Beautiful Soup and Regex:

Step 1: pulling HTML out of a web page (please see the code example above).
Step 2: targeting elements of interest inside the HTML.#targeting all the <ol> elements in a web page
my_element = soup.body.findAll('ol')#targeting all <img class='cats'> elements
my_element = soup.body.findAll('img', attrs={'class': 'cats'})

Step 3: fine-tuning targeted elements with Regex (Regular Expressions),
string concatenation or slicing.#targeting strings that begin with 'class="cat_'
#and end with different words (defined with \w+ in Regex)my_list = re.findall('class="cat_\w+', str(my_element))#slicing 'class="' from every string on the list
my_list = [item[:7] for item in my_list]>> cat_wellness, cat_nutrition, cat_health

Step 4: storing the data inside a Data Frame.
Step 5: exporting Data Frame into a CSV file.df.to_csv('file_name.csv')

For detailed examples and step by step walk-through, please view the following video:

Thank you so much for reading this post, I hope it helped you with your project and showed you the comfort of using this amazing Beautiful Soup library.

URL used in the video:
https://docs.python.org/3/library/random.html

Jupyter Notebook Code:
https://github.com/MariyaSha/WebScraping

Beautiful Soup Documentation:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Subscribe to My YouTube Channel:
https://www.youtube.com/channel/UCKQdc0-Targ4nDIAUrlfKiA?

Add Me on LinkedIn:
https://www.linkedin.com/in/mariyasha888/