Selenium is a portable framework to automate the tests for web applications. It holds the browser commands for all types of actions that users can do in a browser window. The actions like
- Open a URL (website)
- Click buttons and links
- Select elements on the web page and get its value
- Copy text from a specific section
- Get status of the html elements
- Type text into input/text boxes
and what not! It gives us all the controls of web browser, what great that we can achieve is our choice.
Importance of Selenium for data scraping
There are other libraries available for data scraping in python, which are fast and efficient (requests, urllib etc...). Why we need selenium, that needs a complex environment setup, consumes lot of resources and slow compared to the other python scraping alternatives?
Setting Up the environment
Now we see how to setup the environment.
Which ever may be the platform (Operating System) you are on, I suggest to use either Anaconda or Virtualenv to create isolated Python environment instead of installing packages system-wide. Create a python 3.x environment and install the following packages.
Packages Required: Selenium, unicodecsv, Google Chrome driver executable and Chrome web browser ( if not already installed on your machine)
To install the required packages activate the virtual environment you've created earlier, then copy and paste the following pip commands into your shell and execute
# To install the Selenium framework for python pip install selenium # To install the unicodecsv pip install unicodecsv
For chrome driver you can visit this link and download the suitable driver for your platform (Operating System). Extract the Zip file into your current working directory. To install the Google Chrome browser visit this link and follow the instructions.
First we just check whether the chrome browser bindings are working fine and then we proceed with our scraping job. Copy the following code to a text editor and save it as 'selenium-python-scraper.py' in the same directory where you've extracted the chrome driver package.
# import the required packages from selenium import webdriver import os import time # function to create the browser instance. # Looks simple but we add more stuff in the next section def get_browser(): # executable_path = 'path to the chromedriver file that you have downloaded' browser = webdriver.Chrome(executable_path='chromedriver', service_args=['--ssl-protocol=any', '--ignore-ssl-errors=true']) return browser if __name__ == '__main__': # Initiate the browser browser = get_browser() time.sleep(2) # Access Google India website browser.get('https://www.google.co.in/') # Wait for 2 sec to load time.sleep(2) # Find the Input text box with attribute name value 'q' input_box = browser.find_element_by_name('q') # Send the search keyword to the input box input_box.send_keys("Now we know that the chrome bindings are working properly!") time.sleep(5) # Clear the input box input_box.clear() time.sleep(2) input_box.send_keys("Now we're quitting the window!") time.sleep(5) # Quit the browser instance browser.close()
When you run this file, you should see a Google Chrome window gets launched, access Google India home page, type the above mentioned messages and finally quits the window. If everything went well, congratulations you've successfully configured your Python Selenium environment!
Data Scraping Session
In previous post we built a scraper to get product list from amazon.in using Python's requests and BeautifulSoup. Now we implement the same functionality using Selenium WebDriver.
After successful execution, you'll find a file with name 'amazon_product_list.csv' in your project working directory.
Machine Learning / Artificial Intelligence needs lot of data to build intelligence systems. Having these web scraping techiques in your skillset will be handy when you need data which is not readily available for download.
This is a basic example and hope this helps to start on with scraping. We meet again in the next post. Until then 'Eat, Code, Sleep, Repeat'. Thank You .
Last time we have discussed about Web Scraping with Python's BeautifulSoup. In this post I'll explain how to scrape ...
In the last post we went through the web scraping techniques in detail. Now we'll implement the HTML parsing techniques ...
The world is moving fast and every day we see new technologies coming in. Right from the live traffic and wether updates ...