Python Package For Web Scraping



Web scraping is the easiest and fastest method to extract data from the internet. With so many web scraping tutorials and guides available out there on so many frameworks, programming languages, and libraries, it could be quite confusing to pick one for your web scraping needs. In this article, we will take a look at some of the top python web scraping frameworks and libraries.

Popular Python Web Scraping Library and Framework

Mar 10, 2021 Scrapy is a powerful Python web scraping and web crawling framework. Scrapy provides many features to download web pages asynchronously, process them and save them. It handles multithreading, crawling (the process of going from link to link to find every URL in a website), sitemap crawling, and more. Python Modules for Web Scraping Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. Newspaper3k: Article scraping & curation. Inspired by requests for its simplicity and powered by lxml for its speed: 'Newspaper is an amazing python library for extracting & curating articles.' - tweeted by Kenneth Reitz, Author of requests 'Newspaper delivers Instapaper style article extraction.' - The Changelog. Newspaper is a Python3 library! Web scraping Toolkits and Essentials using Python In any case, the first two packages below are incorporated into the majority of web scraping efforts in Python, I would imagine. They work together seamlessly, and both provide invaluable tasks for web scraping.

  1. Scrapy – The complete framework
  2. Urllib
  3. Python Requests
  4. Selenium
  5. Beautifulsoup
  6. LXML

Scrapy – The complete web scraping framework

Scrapy is an open source web scraping framework written in Python which takes care of everything from downloading HTML if web pages to storing them in the form you want. For those of you who are familiar with Django, Scrapy is a lot similar to it. The requests we make on Scrapy are scheduled and processed asynchronously. This is because it is built on top of Twisted, an asynchronous framework.

What is asynchronous?

For those of you, who aren’t familiar, let’s take a case. Assume you have to make 100 phone calls. Now, what would you do? Sit down next to the phone and dial up the first number. Wait for the response, process the call and then move to next. This is how it is with conventional web scraping methods. With Scrapy you can dial up 40 numbers and process each call as and when it receives a response. No time is wasted waiting. This becomes extremely important when your scraping needs are large.

Pros

  • The CPU usage is a lot lesser
  • Consumes lesser memory
  • Extremely efficient in comparison
  • The well-designed architecture offers you both robustness and flexibility.
  • You can easily develop custom middle-ware or pipeline to add custom functionalities

Cons

  • Overkill for simple jobs
  • Might be difficult to install.
  • The learning curve is quite steep.
  • Not very beginner-friendly, since it is a full-fledged framework

Installation:

To install Scrapy using conda run:

Alternatively, if you are more familiar with the installation from PyPI, you can install using pip as :

Note that sometimes this may require solving compilation issues for some Scrapy dependencies depending on your operating system, so be sure to check the Platform specific installation notes.

Best Use Case

Scrapy is best if you need to build a real spider or web-crawler for large web scraping needs, instead of just scraping a few pages here and there. It can offer extensibility and flexibility to your project.

Learn more:

Urllib

As the official docs say, urllib is a python web scraping library with several modules for working with URLs (Uniform Resource Locators). It also offers a slightly more complex interface for handling common situations – like basic authentication, encoding, cookies, proxies, and so on. These are provided by objects called handlers and openers.

  • urllib.request for opening and reading URLs
  • urllib.error containing the exceptions raised by urllib.request
  • urllib.parse for parsing URLs
  • urllib.robotparser for parsing robots.txt files

Pros

  • Included in python standard library
  • It defines functions and classes to help with URL actions (basic and digest authentication, redirections, cookies, etc)

Cons

  • Unlike Requests, while using urllib you will need to use the method urllib.encode() to encode the parameters before passing them
  • Complicated when compared to Python Requests

Installation:

Urllib is already included with your python installation, you don’t need to install it.

Best Use Case

  • If you need advanced control over the requests you make

Requests – HTTP for humans

Requests is a python library. It is the perfect example how beautiful an API can be with the right level of abstraction. The Requests library allows you to send organic, grass-fed requests, without the need for manual labor.

Pros

  • Easier and shorter codes than Urllib.
  • Thread-safe.
  • Multipart File Uploads & Connection Timeouts
  • Elegant Key/Value Cookies & Sessions with Cookie Persistence
  • Automatic Decompression
  • Basic/Digest Authentication
  • Browser-style SSL Verification
  • Keep-Alive & Connection Pooling
  • Good Documentation
  • No need to manually add query strings to your URLs
  • Supports the entire restful API, i.e., all its methods – PUT, GET, DELETE, POST.

Cons

  • If your web page has javascript hiding or loading content, then requests might not be the way to go.

Installation

You can install requests using conda as :

and pip using:

Best Use Case

  • If you are a beginner, and your scraping task is simple and contains no javascript elements

Learn More: How To Scrape Amazon Product Details and Pricing using Python

Selenium – The Automator

Selenium is a library that automates browsers based on Java. But you can access it via the Python Package Selenium. Though primarily used as a tool for writing automated tests for web applications, it has come to some heavy use for pages that have javascript on them.

Pros

  • Beginner-friendly
  • You get the real browser to see what’s going on ( unless you are on a headless mode )
  • Mimics human behavior while browsing, including clicks, selection, filling text box and scrolling
  • Renders a full webpage and shows HTML rendered via XHR or Javascript

Cons

  • Very slow
  • Heavy memory use
  • High CPU usage

Installation

To install this package with conda run:

conda install -c conda-forge selenium

Using pip, you can install by running the below on your terminal

But you will need to install Selenium Web Driver or geckodriver for Firefox browser interface. Failure of the same results in errors. See more of the installation instructions here.

Best Use Case

  • When you need to scrape sites with data tucked away by JavaScript.

Learn More: Web Scraping Hotel Prices using Selenium and Python

The Parsers

  1. Beautiful Soup
  2. LXML

Now that we have got our required HTML content, the job is to go through them and get the data. As well explained here, regular expressions can be used to extract data from an HTML document. It is almost never the best way to create maintainable code. With a regex, you are simply parsing text with no structure and thus you are more likely to come across errors. Why bother so when HTML in itself is presenting text in a given structure. Instead of a text match and regex, we could just parse. If writing maintainable code, it’s best to use parsers.

What are Parsers?

A parser is simply a program that can extract data from HTML and XML documents. They parse the structure into memory and felicitates the use of selectors (either CSS or XPath) to easily extract the data. The advantage here is that parsers can automatically correct “bad” HTML (unclosed tags, badly quoted attributes, invalid HTML etc) and allow us to get the data we need. The disadvantage is that it requires more processor work in most cases, but as ever it’s a trade-off and tends to be a worthwhile one.

BS4

Beautiful Soup (BS4) is a python library that can parse data. As we already know, the content we are parsing might belong to various encodings. BS4 automatically detects it. BS4 creates a parse tree which helps you navigate a parsed document easily and find what you need.

Pros

  • Easier to write a BS4 snippet than LXML.
  • Small learning curve, easy to learn.
  • Quite robust.
  • Handles malformed markup well.
  • Excellent support for encoding detection

Cons

Python Package For Web Scraping In Java

  • If the default parser chosen for you is incorrect, they may incorrectly parse results without warnings, which can lead to disastrous results.
  • Projects built using BS4 might not be flexible in terms of extensibility.
  • You need to import multiprocessing to make it run quicker

Installation:

To install this package with conda run:
conda install -c anaconda beautifulsoup4

Using pip, you can install using

Best Use Case

  • When you are a beginner to web scraping.
  • If you need to handle messy documents, choose Beautiful Soup.

LXML

Lxml is a high-performance, production-quality HTML and XML parsing library.

Pros

  • It is the best performing parser. As shown here
  • Most feature-rich python library for the same.

Cons

  • The official documentation isn’t that friendly, so beginners are better off starting somewhere else.

Installation:

With conda run, you can install using,

To install this package with conda run:
conda install -c anaconda lxml

You can install using LXML directly using pip,

Best Use Case

  • If you need speed, go for lxml.

Learn More:

Our recommendations

If your web scraping needs are simple, then any of the above tools might be easy to pick up and implement. For smaller data requirements there are free web scraping tools you could try out that do not need much coding skills and are cost-effective. We have a post on free web scraping tools that can help you decide.

But when you have a large amount of data that needs to be scraped consistently, especially over pages that might even change it’s structure and links, doing it on your own might be too much of an effort.

Python Packages For Web Scraping

You might want to look at Scalable do-it-yourself scraping & How to build and run scrapers on a large scale

If you are looking for some professional help with scraping complex websites, let us know by filling up the form below.

Tell us about your complex web scraping projects

Turn the Internet into meaningful, structured and usable data



Disclaimer:Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Web crawling is a powerful technique to collect data from the web by finding all the URLs for one or multiple domains. Python has several popular web crawling libraries and frameworks.

In this article, we will first introduce different crawling strategies and use cases. Then we will build a simple web crawler from scratch in Python using two libraries: requests and Beautiful Soup. Next, we will see why it’s better to use a web crawling framework like Scrapy. Finally, we will build an example crawler with Scrapy to collect film metadata from IMDb and see how Scrapy scales to websites with several million pages.

What is a web crawler?

Python Package For Web Scraping Pdf

Using python to scrape website

Web crawling and web scraping are two different but related concepts. Web crawling is a component of web scraping, the crawler logic finds URLs to be processed by the scraper code.

A web crawler starts with a list of URLs to visit, called the seed. For each URL, the crawler finds links in the HTML, filters those links based on some criteria and adds the new links to a queue. All the HTML or some specific information is extracted to be processed by a different pipeline.


Web

Web crawling strategies

In practice, web crawlers only visit a subset of pages depending on the crawler budget, which can be a maximum number of pages per domain, depth or execution time.

Most popular websites provide a robots.txt file to indicate which areas of the website are disallowed to crawl by each user agent. The opposite of the robots file is the sitemap.xml file, that lists the pages that can be crawled.

Popular web crawler use cases include:

  • Search engines (Googlebot, Bingbot, Yandex Bot…) collect all the HTML for a significant part of the Web. This data is indexed to make it searchable.
  • SEO analytics tools on top of collecting the HTML also collect metadata like the response time, response status to detect broken pages and the links between different domains to collect backlinks.
  • Price monitoring tools crawl e-commerce websites to find product pages and extract metadata, notably the price. Product pages are then periodically revisited.
  • Common Crawl maintains an open repository of web crawl data. For example, the archive from October 2020 contains 2.71 billion web pages.

Next, we will compare three different strategies for building a web crawler in Python. First, using only standard libraries, then third party libraries for making HTTP requests and parsing HTML and finally, a web crawling framework.

Building a simple web crawler in Python from scratch

To build a simple web crawler in Python we need at least one library to download the HTML from a URL and an HTML parsing library to extract links. Python provides standard libraries urllib for making HTTP requests and html.parser for parsing HTML. An example Python crawler built only with standard libraries can be found on Github.

The standard Python libraries for requests and HTML parsing are not very developer-friendly. Other popular libraries like requests, branded as HTTP for humans, and Beautiful Soup provide a better developer experience. You can install the two libraries locally.

Web Scraper Python Tutorial

A basic crawler can be built following the previous architecture diagram.

The code above defines a Crawler class with helper methods to download_url using the requests library, get_linked_urls using the Beautiful Soup library and add_url_to_visit to filter URLs. The URLs to visit and the visited URLs are stored in two separate lists. You can run the crawler on your terminal.

The crawler logs one line for each visited URL.

The code is very simple but there are many performance and usability issues to solve before successfully crawling a complete website.

  • The crawler is slow and supports no parallelism. As can be seen from the timestamps, it takes about one second to crawl each URL. Each time the crawler makes a request it waits for the request to be resolved and no work is done in between.
  • The download URL logic has no retry mechanism, the URL queue is not a real queue and not very efficient with a high number of URLs.
  • The link extraction logic doesn’t support standardizing URLs by removing URL query string parameters, doesn’t handle URLs starting with #, doesn’t support filtering URLs by domain or filtering out requests to static files.
  • The crawler doesn’t identify itself and ignores the robots.txt file.

Next, we will see how Scrapy provides all these functionalities and makes it easy to extend for your custom crawls.

Web crawling with Scrapy

Scrapy is the most popular web scraping and crawling Python framework with 40k stars on Github. One of the advantages of Scrapy is that requests are scheduled and handled asynchronously. This means that Scrapy can send another request before the previous one is completed or do some other work in between. Scrapy can handle many concurrent requests but can also be configured to respect the websites with custom settings, as we’ll see later.

Scrapy has a multi-component architecture. Normally, you will implement at least two different classes: Spider and Pipeline. Web scraping can be thought of as an ETL where you extract data from the web and load it to your own storage. Spiders extract the data and pipelines load it into the storage. Transformation can happen both in spiders and pipelines, but I recommend that you set a custom Scrapy pipeline to transform each item independently of each other. This way, failing to process an item has no effect on other items.

On top of all that, you can add spider and downloader middlewares in between components as it can be seen in the diagram below.


Scrapy Architecture Overview [source]

If you have used Scrapy before, you know that a web scraper is defined as a class that inherits from the base Spider class and implements a parse method to handle each response. If you are new to Scrapy, you can read this article for easy scraping with Scrapy.

Scrapy also provides several generic spider classes: CrawlSpider, XMLFeedSpider, CSVFeedSpider and SitemapSpider. The CrawlSpider class inherits from the base Spider class and provides an extra rules attribute to define how to crawl a website. Each rule uses a LinkExtractor to specify which links are extracted from each page. Next, we will see how to use each one of them by building a crawler for IMDb, the Internet Movie Database.

Building an example Scrapy crawler for IMDb

Before trying to crawl IMDb, I checked IMDb robots.txt file to see which URL paths are allowed. The robots file only disallows 26 paths for all user-agents. Scrapy reads the robots.txt file beforehand and respects it when the ROBOTSTXT_OBEY setting is set to true. This is the case for all projects generated with the Scrapy command startproject.

This command creates a new project with the default Scrapy project folder structure.

Then you can create a spider in scrapy_crawler/spiders/imdb.py with a rule to extract all links.

You can launch the crawler in the terminal.

You will get lots of logs, including one log for each request. Exploring the logs I noticed that even if we set allowed_domains to only crawl web pages under https://www.imdb.com, there were requests to external domains, such as amazon.com.

IMDb redirects from URLs paths under whitelist-offsite and whitelist to external domains. There is an open Scrapy Github issue that shows that external URLs don’t get filtered out when the OffsiteMiddleware is applied before the RedirectMiddleware. To fix this issue, we can configure the link extractor to deny URLs starting with two regular expressions.

Rule and LinkExtractor classes support several arguments to filter out URLs. For example, you can ignore specific URL extensions and reduce the number of duplicate URLs by sorting query strings. If you don’t find a specific argument for your use case you can pass a custom function to process_links in LinkExtractor or process_values in Rule.

Web

Using Python To Scrape Website

For example, IMDb has two different URLs with the same content.

To limit the number of crawled URLs, we can remove all query strings from URLs with the url_query_cleaner function from the w3lib library and use it in process_links.

Now that we have limited the number of requests to process, we can add a parse_item method to extract data from each page and pass it to a pipeline to store it. For example, we can either extract the whole response.text to process it in a different pipeline or select the HTML metadata. To select the HTML metadata in the header tag we can code our own XPATHs but I find it better to use a library, extruct, that extracts all metadata from an HTML page. You can install it with pip install extract.

I set the follow attribute to True so that Scrapy still follows all links from each response even if we provided a custom parse method. I also configured extruct to extract only Open Graph metadata and JSON-LD, a popular method for encoding linked data using JSON in the Web, used by IMDb. You can run the crawler and store items in JSON lines format to a file.

The output file imdb.jl contains one line for each crawled item. For example, the extracted Open Graph metadata for a movie taken from the <meta> tags in the HTML looks like this.

The JSON-LD for a single item is too long to be included in the article, here is a sample of what Scrapy extracts from the <script type='application/ld+json'> tag.

Exploring the logs, I noticed another common issue with crawlers. By sequentially clicking on filters, the crawler generates URLs with the same content, only that the filters were applied in a different order.

Long filter and search URLs is a difficult problem that can be partially solved by limiting the length of URLs with a Scrapy setting, URLLENGTH_LIMIT.

Python Package For Web Scraping Free

I used IMDb as an example to show the basics of building a web crawler in Python. I didn’t let the crawler run for long as I didn’t have a specific use case for the data. In case you need specific data from IMDb, you can check the IMDb Datasets project that provides a daily export of IMDb data and IMDbPY, a Python package for retrieving and managing the data.

Web crawling at scale

If you attempt to crawl a big website like IMDb, with over 45M pages based on Google, it’s important to crawl responsibly by configuring the following settings. You can identify your crawler and provide contact details in the BOT_NAME setting. To limit the pressure you put on the website servers you can increase the DOWNLOAD_DELAY, limit the CONCURRENT_REQUESTS_PER_DOMAIN or set AUTOTHROTTLE_ENABLED that will adapt those settings dynamically based on the response times from the server.

Notice that Scrapy crawls are optimized for a single domain by default. If you are crawling multiple domains check these settings to optimize for broad crawls, including changing the default crawl order from depth-first to breath-first. To limit your crawl budget, you can limit the number of requests with the CLOSESPIDER_PAGECOUNT setting of the close spider extension.

With the default settings, Scrapy crawls about 600 pages per minute for a website like IMDb. To crawl 45M pages it will take more than 50 days for a single robot. If you need to crawl multiple websites it can be better to launch separate crawlers for each big website or group of websites. If you are interested in distributed web crawls, you can read how a developer crawled 250M pages with Python in 40 hours using 20 Amazon EC2 machine instances.

In some cases, you may run into websites that require you to execute JavaScript code to render all the HTML. Fail to do so, and you may not collect all links on the website. Because nowadays it’s very common for websites to render content dynamically in the browser I wrote a Scrapy middleware for rendering JavaScript pages using ScrapingBee’s API.

Conclusion

We compared the code of a Python crawler using third-party libraries for downloading URLs and parsing HTML with a crawler built using a popular web crawling framework. Scrapy is a very performant web crawling framework and it’s easy to extend with your custom code. But you need to know all the places where you can hook your own code and the settings for each component.

Python Best Web Scraper

Configuring Scrapy properly becomes even more important when crawling websites with millions of pages. If you want to learn more about web crawling I suggest that you pick a popular website and try to crawl it. You will definitely run into new issues, which makes the topic fascinating!

Sources