February 11, 2019
Data Scraping Techniques

The world is moving fast and every day we see new technologies coming in. Right from the live traffic and wether updates on our smartphones to the AI based personal voice assistants that we use today, everything is driven by data. The source of this data is the World Wide Web. In this post I'm gonna explain different methods of data scraping from the Internet.

 

If we talk about the data extraction/ gathering from the Internet, can be accomplished in two ways.

 

  1. From the official API provided by the site admin. For example, most of the social media sites (Twitter, Facebook etc..) provide official APIs to access data from their websites and they may charge based on the bandwidth we use.
  2. As not all websites are providing APIs, we've to employ humans or build an automated program/tool to do the job for us. We call this way of data extraction as Data Scraping!

 

Web Scraping (also termed Data Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites and saved to a local file on our computer or database in a more structured format. This is a field with active developments happened and still a lot of research is going on to improve the semantic visual parsing. i.e.. To make the computer to do visual parsing of the web page and extract information based on the true meaning of the content present in it like humans do!

 

Now we see the techniques that we can follow for web scraping.

 

Human Copy and Paste

This is the most simple method. People manually browse the internet and copy the required information in to a File/Database.

 

  • Advantages: 
    • Very simple method.
    • No need of programming experience/tool guidance
    • Useful when the websites for scraping explicitly set up barriers to prevent machine automation.

 

  • Disadvantages:
    • Lot of manual efforts needed.
    • Takes more time compared to automated machine scraping
    • Extremely painful if require data from thousands of websites.

 

Text pattern matching

A simple yet powerful approach to extract information from web pages can be based on the UNIX cURLgrep commands or regular expression matching facilities of programming languages.

 

  • Advantages
    • Easy to write logic if we're familiar with writing regular expressions.
    • Supported by most of the programming languages and the structure remains same
    • Really good at extracting information from unstructured data as it is based on pattern matching.

 

  • Disadvantages
    • Difficult and time consuming process to write Regexp for someone who is not familiar with it.
    • Time taking process to write new Regexp if the target website changes the structure.

 

HTML parsing

This is the most common method used by the developers. Internet has large number of web pages (HTML) pages which are static / dynamic pages that are being generated automatically by web servers. We use HTML parsers to parse these pages and extract the information that is need.

 

  • Advantages
    • Easy to write and maintain as there are readily available HTML parsers in most of the programming languages
    • Output will be very clean and structured as we can select specific piece of information from the page using the HTML parser

 

  • Disadvantages
    • This works based on structure of the page, we may need to change the selectors if website's structure changes.
    • The chances of getting blocked by the server is more as we're not using a regular Internet browser.

 

DOM parsing

Using a full fledged web browser, programs can retrieve dynamic content generated by client side scripts. Software like Selenium can employ a real web browser like Google chrome, Firefox etc.. in headless mode (Runs the web browser drivers in the background without a user interface) / in a virtual display to access the websites, parse and extract the information.

  • Advantages
    • We can get dynamic content as pages are generated dynamically on client side
    • Chances of getting blocked will be less as we're using a real browser
    • Writing selectors will be easier as we're processing the same data as we see on the browser

 

  • Disadvantages
    • This method will be resource intensive as we're running a complete web browser program in the background
    • Little slow compared to the HTML parsing method as the page build happens on the client side

 

As we've discussed earlier, lot of research is going-on in this field to improve the Visual Semantic parsing of the web pages. This way we can eliminate the dependency of parsers as the system parses the web page exactly as human eye does. Currently this method is mostly used for research purpose as it is complex to build/setup and the accuracy will be far less compared to the conventional methods.

 

Scraping Ethics

Being a scraper, it's more important to have some ethics. These are not strict laws/rules that we've to follow but to keep a good balance in the way that internet works. 

  • Don't tap into the protected data or sensitive user information.
  • Always read the robots.txt and try to obey the terms of use.
  • Do not put overload on the server. Request the data only at reasonable rates.
  • Try not to be considered as DDoS (Distributed Denial of Service) attack.
  • Do not pass off data as your own when it's not. Don't violate copyright laws.
  • Try to provide value back to the site you are scraping, such as driving traffic by crediting the site in the article/post
  • Only save the data that you need.

There are many other people/companies like us who do scraping according to their goals / preferences. Please respect the content providers (websites), follow the above mentioned guidelines and be an ethical scraper.

 

Hope this helps! We'll discuss about the implementation of these methods using Python in the next post. Until then "Eat, Code, Sleep, Repeat ". Thank You .

 


Recent Posts
February 16, 2019

 

Last time we have discussed about Web Scraping with Python's BeautifulSoup. In this post I'll explain how to scrape ...

February 12, 2019

In the last post we went through the web scraping techniques in detail. Now we'll implement the HTML parsing techniques ...

February 11, 2019

The world is moving fast and every day we see new technologies coming in. Right from the live traffic and wether updates ...

Blog-Posts