February 12, 2019
Web Scraping with Python and BeautifulSoup

In the last post we went through the web scraping techniques in detail. Now we'll implement the HTML parsing techniques using Python programming language. Before that we'll have a glance at HTML (Hyper Text Markup Language) structural properties.

 

HTML is a child of Standard Generalized Markup Language (SGML), and has a sibling called Extensible Markup Language (XML). SGML provides a way to define Markup language. It states that what should be the structure of the language, what elements should it contain such as tags. As a parent SGML passes structure and format rules to HTML and XML. HTML has set of predefined rules and tags where as XML is very flexible to port to any kind of data by defining its own namespace.

 

HTML adds the form and appearance of the webpage. We can even embed media like images and videos, add styles to the font, create sections like headers and footers etc... 

<!DOCTYPE html>
<html lang='en-IN'>
    <head>
        // Your Page's title goes here
        <title>First HTML Page</title>
    </head>
    <body>
        <h1 class='page-header' id='header1'>HTML Page header</h1>
        <p class='first-paragraph' id='paragraph1'><b>HTML</b> heading tags (&lt;h1/&gt;, &lt; h2/ &gt;, etc..) are used to markup the heading for paragraphs, page sections etc... </p>
    </body>
</html>

 

Copy the above HTML code to a text editor and save it as 'some-file-name.html'. Open this file with any web browser like Google Chrome or Firefox. It should look like below

 

HTML Page header

   HTML heading tags (<h1>, < h2>, etc..) are used to markup the heading for paragraphs, page sections etc...

 

Now we go through the HTML code

The first line of a html page should be <!DOCTYPE html> which shows that this is a HTML document. The web browsers can execute the document even if this line is missing as HTML is the standard language for web pages. The piece of text including the less-than and greater-than symbols is called a tag. <#> is a start tag and </#> is an end tagEach tag in the above code has a specific definition so that the browsers can interpret and present it.

  • <html> - Shows the start and end of the html content and represents the root of the HTML tree.
  • <head> - Defines a section called page head, which includes the meta-properties of the document like title, scripts that are needed by the web page etc...
  • <body>  - This section defines the visible part of the page. The content what ever goes here, will be visible to the user when we open it through web browser.
  • <h1>, <h2>, <h3>, ... are the tags to hold headings and their weights (font-size and boldness).
  • <p>  is to define paragraphs and their properties.
  • <b> (bold) is to show the weight of the font.

 

This is a very basic example of HTML page. The pages that we see (browse) on the Internet will have very complex definitions written into them. Even the blog page that you're reading right now has hundreds of tags and thousands of properties defined for them.

 

This knowledge of HTML is well enough to start with web scraping.

In any web scraping technique there will be two main levels 

  1. Requesting the Server for HTML document and downloading it
  2. Parsing the HTML tree, so that we can access the data that we need by traversing the tree

 

Scraping with HTML parsing

Python Version: Python 3.x

Libraries Required: requests, BeautifulSoup and html5lib (or) lxml, unicodecsv

Installation

To install the required packages copy and paste the following pip commands into your shell and execute

# Installing requests lib using pip
pip install requests

# Installing BeautifulSoup lib (bs4) with pip
pip install bs4

# Installing html5lib/lxml with pip
pip install html5lib

     (or)

pip install lxml

# Installing unicodecsv
pip install unicodecsv

 

Goal

Suppose if we want to get list of new smartphones and prices that are available in amazon.in. Go to amazon.in website and find out the url for the smartphone list page. Scrape the list of products upto 3 pages and store the data in a csv.


# Import the necessary modules
import requests
from bs4 import BeautifulSoup
import unicodecsv as csv
import time # To keep time gap between requests
import random # To select a random time value

 

Next we'll write a simple scraper that gets and prints the product names from the first page.


url = "https://www.amazon.in/s/ref=sr_nr_n_1?fst=as%3Aoff&rh=n%3A976419031%2Cn%3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cn%3A1805560031&bbn=1389432031&sort=date-desc-rank&ie=UTF8&qid=1549909857&rnid=1389432031"

# Request the server for the page content 
request = requests.get(url, headers={'User-agent': 'Sample bot Demo'})

# Get the data from the request object
page_content = request.text

# Parse the data with html5lib using BeautifulSoup
soup = BeautifulSoup(page_content, 'html5lib')

# Select all the product names from the HTML tree and print
for p_name in soup.find_all('h2', attrs={'class':'s-access-title'}):
    print(p_name.text.strip())

 

Copy the above two code blocks to a file and save it as 'python_scraper.py' and run it in shell as - python python_scraper.py. The output list of product names will match the product names present on amazon page. Here you might wonder that where did the attrs' values came from?

Open the link in Google Chrome / Firefox, right click on any one of the product names and click on inspect. Select suitable class value of the element that you've inspected as shown in the pictures below.

 

I hope the process is clear, and I'll proceed with completing the task. Same thing aplies to other parameters that you want to extract from the page like price, url etc...

 


import requests
from bs4 import BeautifulSoup
import unicodecsv as csv
import time
import random

# Scrapes the sample product list from amazon.in
def amazon_list_scraper(url, num_pages):
    # Open a file to store the scraped content
    products_file = open('amazon_product_list.csv', 'wb')

    # Create a csv writer method on the file
    writer = csv.writer(products_file,delimiter=",",lineterminator="\n", 
                        quoting=csv.QUOTE_ALL)

    # Write the header for the file
    writer.writerow(['product_name', 'price', 'url'])

    # Iterate for the first #n pages
    for page_num in range(1, num_pages+1):
        # Add the page variable to the url
        url_page = url+'&page={}'.format(page_num)

        # Request for the url data, don't forget to add User-agent. 
        # Without user-agent amazon doesn't allow to access data
        request = requests.get(url_page, headers={'User-agent': 'Demo bot'})

        # If the status code of the request is 2xx that means request accepted.
        if 200 <= request.status_code <= 299:
            page_content = request.text # Get the text of the page (HTML content)

            # Parse the html with html5lib
            soup = BeautifulSoup(page_content, 'html5lib')

            # Iterate on all the products found on the page
            for product in soup.find_all('li', attrs={'class':'s-result-item'}):
                try:
                    # soup.find takes the first element matching the criteria
                    product_name = product.find('h2', attrs={'class':'s-access-title'}).text
                    product_price = product.find('span', attrs={'class':'a-color-price'}).text
                    product_url = product.find('a', 
                                               attrs={'class':'s-access-detail-page'})['href']
  
                    # Write data to the file
                    writer.writerow([product_name, product_price, product_url])

                    # Introducing reasonable random time interval between requests (2-5S)
                    time.sleep(random.randint(2,5))
                except AttributeError:
                    pass
                except UnicodeDecodeError:
                    pass
                except:
                    pass

        # status code 403 is for forbidden urls
        elif request.status_code == 403:
            print("Forbidden URL")
        
        # Page not found status code
        elif request.status_code == 404:
            print("URL doesn't exist.")
  
        # Service unavailable status code
        elif request.status_code == 503:
            print("Service Unavailable.")

        else:
            print(request.status_code)
    
    # Close the file at last
    products_file.close()


if __name__ = "__main__":
    # URL of the amazon product list
    url = "https://www.amazon.in/s/ref=sr_nr_n_1?fst=as%3Aoff&rh=n%3A976419031%2Cn%\
          3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cn%3A1805560031&bbn=13894\
          32031&sort=date-desc-rank&ie=UTF8&qid=1549909857&rnid=1389432031"
    amazon_list_scraper(url, 3)

 

Hope this helps. We meet again in the next post. Until then 'Eat, Code, Sleep, Repeat'. Thank You .

Note: This is a simple and sample scraping code. I've tried to scrape only first 3 pages without imposing load on the amazon server. If you want to scrape more data from amazon, go through it's access rules in robots.txt and you wanna know more about the ethics of web scraping go through my previous post.

 

 


Recent Posts
February 16, 2019

 

Last time we have discussed about Web Scraping with Python's BeautifulSoup. In this post I'll explain how to scrape ...

February 12, 2019

In the last post we went through the web scraping techniques in detail. Now we'll implement the HTML parsing techniques ...

February 11, 2019

The world is moving fast and every day we see new technologies coming in. Right from the live traffic and wether updates ...

Blog-Posts