February 16, 2019
Web scraping with Python and Selenium

 

Last time we have discussed about Web Scraping with Python's BeautifulSoup. In this post I'll explain how to scrape data using Selenium and Python!. This method of data scraping called DOM parsing

 

Selenium is a portable framework to automate the tests for web applications. It holds the browser commands for all types of actions that users can do in a browser window. The actions like 

  1. Open a URL (website)
  2. Click buttons and links
  3. Select elements on the web page and get its value
  4. Copy text from a specific section
  5. Get status of the html elements
  6. Type text into input/text boxes

and what not! It gives us all the controls of web browser, what great that we can achieve is our choice.

 

Importance of Selenium for data scraping

There are other libraries available for data scraping in python, which are fast and efficient (requests, urllib etc...). Why we need selenium, that needs a complex environment setup, consumes lot of resources and slow compared to the other python scraping alternatives?

 

If we see the architecture of a typical web page, it includes HTML, CSS and JavaScript. HTML gives the skeletal structure to the page, CSS puts beauty and alignment and the JavaScript adds the dynamic functionality. In the evolution of web technologies JavaScript took more weight on its shoulders. Now websites are capable of loading and rendering the HTML or data dynamically based on the user's action.

The other alternatives of Python are limited to HTML. They can do the job of loading HTML by accessing a URL but they can not execute JavaScript to load the dynamic data. Which shows the clear importance of a library which can access a website and capable of executing JavaScript. Selenium is one of a kind to achieve this goal as it employs a real web browser keeping its controls with us.

 

Setting Up the environment

Now we see how to setup the environment.

Which ever may be the platform (Operating System) you are on, I suggest to use either Anaconda or Virtualenv to create isolated Python environment instead of installing packages system-wide. Create a python 3.x environment and install the following packages.

Packages Required: Selenium, unicodecsv, Google Chrome driver executable and Chrome web browser ( if not already installed on your machine)

Installation

To install the required packages activate the virtual environment you've created earlier, then copy and paste the following pip commands into your shell and execute


# To install the Selenium framework for python
pip install selenium

# To install the unicodecsv
pip install unicodecsv

 

For chrome driver you can visit this link and download the suitable driver for your platform (Operating System). Extract the Zip file into your current working directory. To install the Google Chrome browser visit this link and follow the instructions.

 

WebDriver (ChromeDriver, FirefoxDriver, etc..) is an open source tool for automated testing of webapps across many browsers. It provides capabilities of navigating to web pages, clicking buttons, filling user input forms, executing JavaScript, and more. ChromeDriver simply helps Selenium to do this job on Chrome. In more technical words, ChromeDriver is a standalone server which implements WebDriver’s wire protocol for Chromium.

 

First we just check whether the chrome browser bindings are working fine and then we proceed with our scraping job. Copy the following code to a text editor and save it as 'selenium-python-scraper.py' in the same directory where you've extracted the chrome driver package.

 


# import the required packages
from selenium import webdriver
import os
import time


# function to create the browser instance.
# Looks simple but we add more stuff in the next section
def get_browser():
    # executable_path = 'path to the chromedriver file that you have downloaded'
    browser = webdriver.Chrome(executable_path='chromedriver', 
                               service_args=['--ssl-protocol=any', 
                               '--ignore-ssl-errors=true'])
    return browser


if __name__ == '__main__':
    # Initiate the browser
    browser = get_browser()
    time.sleep(2)

    # Access Google India website
    browser.get('https://www.google.co.in/')

    # Wait for 2 sec to load
    time.sleep(2)

    # Find the Input text box with attribute name value 'q'
    input_box = browser.find_element_by_name('q')

    # Send the search keyword to the input box
    input_box.send_keys("Now we know that the chrome bindings are working properly!")

    time.sleep(5)

    # Clear the input box
    input_box.clear()
    time.sleep(2)

    input_box.send_keys("Now we're quitting the window!")
    time.sleep(5)

    # Quit the browser instance
    browser.close()

 

When you run this file, you should see a Google Chrome window gets launched, access Google India home page, type the above mentioned messages and finally quits the window. If everything went well, congratulations you've successfully configured your Python Selenium environment!

 

Data Scraping Session

In previous post we built a scraper to get product list from amazon.in using Python's requests and BeautifulSoup. Now we implement the same functionality using Selenium WebDriver.

 


from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import NoSuchElementException
import time
import unicodecsv as csv


def get_browser():
    dcap = dict(DesiredCapabilities.CHROME).copy()
    dcap["chrome.switches"] = ("--user-agent=Demo bot")
    options = webdriver.ChromeOptions()
    # options.add_argument('--headless') # disables the UI (User Interface) and runs the browser in the background

    # Disables the chrome's sandbox (Recommended not to enable if you access non trusted websites)
    # options.add_argument('--no-sandbox')
    browser = webdriver.Chrome(executable_path='chromedriver',
                               desired_capabilities=dcap,
                               service_args=['--ssl-protocol=any', 
                                             '--ignore-ssl-errors=true'],
                               chrome_options=options)
    return browser


def get_product_list(browser, url, num_pages):
    # Open a file to store the scraped content
    products_file = open('amazon_product_list.csv', 'wb')

    # Create a csv writer method on the file
    writer = csv.writer(products_file, delimiter=",", lineterminator="\n",
                        quoting=csv.QUOTE_ALL)

    # Write the header for the file
    writer.writerow(['product_name', 'price', 'url'])

    # Access the URL
    browser.get(url)
    time.sleep(1)

    # Iterate for the first #n pages
    for page_num in range(1, num_pages + 1):
        print ("Extracting page {} data.".format(page_num))
        # Add the page variable to the url

        # Iterate on all the products found on the page
        for product in browser.find_elements_by_class_name('s-result-item'):
            try:
                # soup.find takes the first element matching the criteria
                product_name = product.find_element_by_class_name('s-access-title').text
                product_price = product.find_element_by_class_name('a-color-price').text
                product_url = product.find_element_by_class_name('s-access-detail-page'). \
                    get_attribute('href')

                # Write data to the file
                writer.writerow([product_name, product_price, product_url])

                # Introducing reasonable random time interval between requests (2-5S)
                # time.sleep(random.randint(0, 3))
            except NoSuchElementException:
                pass
            except UnicodeDecodeError:
                pass
            except:
                pass

        # JavaScript code to click on the next page link
        javascript_code = "document.getElementById('pagnNextLink').click();"
        # execute the JavaScript in the browser window
        browser.execute_script(javascript_code)
        time.sleep(5)

    # Close the file at last
    products_file.close()


if __name__ == '__main__':
    url = "https://www.amazon.in/s/ref=sr_nr_n_1?fst=as%3Aoff&rh=n%3A976419031%2Cn%3A" \
          "%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cn%3A1805560031" \
          "&bbn=1389432031&sort=date-desc-rank&ie=UTF8&qid=1549909857&rnid=1389432031"
    browser = get_browser()
    get_product_list(browser=browser, url=url, num_pages=3)
    browser.close()
    print ("Successfully scraped and saved data.")

 

After successful execution, you'll find a file with name 'amazon_product_list.csv' in your project working directory.

 

Conclusion

Machine Learning / Artificial Intelligence needs lot of data to build intelligence systems. Having these web scraping techiques in your skillset will be handy when you need data which is not readily available for download. 

 

This is a basic example and hope this helps to start on with scraping. We meet again in the next post. Until then 'Eat, Code, Sleep, Repeat'. Thank You .

 


Recent Posts
February 16, 2019

 

Last time we have discussed about Web Scraping with Python's BeautifulSoup. In this post I'll explain how to scrape ...

February 12, 2019

In the last post we went through the web scraping techniques in detail. Now we'll implement the HTML parsing techniques ...

February 11, 2019

The world is moving fast and every day we see new technologies coming in. Right from the live traffic and wether updates ...

Blog-Posts