Pycharm Web Scraping

I decided to write a short post about how I use Python and XPath to extract web content. I do this often to build research data sets. This post was inspired by another blog post: Luciano Mammino – Extracting data from Wikipedia using curl, grep, cut and other shell commands.

Where Luciano uses a bunch of Linux command line tools to extract data from Wikipedia, I thought I’d demonstrate pulling the same data using Python and XPath. Once I discovered using XPath in Python, my online data collection for research became a whole lot easier!

Please fill the form below and send your information about the website which is going to be scraped by scrapipy team. After the form be sent, we do a short research about the case and send a proper proposal base on the dead line and your budget threshold. In this blog, we are going to implement a simple web crawler in python which will help us in scraping yahoo finance website. Some of the applications of scraping Yahoo finance data can be forecasting stock prices, predicting market sentiment towards a stock, gaining an investive edge and cryptocurrency trading.

XPath to query parts of an HTML structure

XPath is a way of identifying nodes and content in an XML document structure (including HTML). You can create an XPath query to find specific tables, reference specific rows, or even find cells of a table with certain attributes. It’s a great way to slice up content on a web site.

We’ll start with the target URL https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo. We extract the HTML document elements and identify our medalists using the table structure on the page.

Use an IDE!

I highly recommend you do this in a good Integrated Development Environment (IDE) – PyCharm is the best I’ve seen for Python development and there’s a free community edition! If you think you’re too hardcore, then go for it with a text editor, whatever floats your boat.

In PyCharm I setup the basic URL download, set a breakpoint and then in debug mode, I evaluate expressions until I home in to my target content.

Python to grab HTML content

The first bit of Python code just pulls in the web page as a string, and creates an XML tree out of it, so we can use the data with XPath:

Using Chrome to identify elements and XPaths

Now we need to know what to extract. An easy way to work out the approximate XPath query is to use Chrome web browser, right-click an element of interest and ”Inspect Element”. You’ll get a bunch of data on the side about the element content:

Copying an XPath using Chrome browser and “Inspect Element” gives us a good starting point.

On the HTML element of interest, right click, select copy -> copy xpath. This will give you information on how to reference that very specific element.

Hot Tip! Code in debug break-mode

Insert a breakpoint and debug your code so we can test the copied XPath query in the ‘evaluate expression window’. You’ll need to play around with your query to make sure your getting the results you want.

Using debug mode in PyCharm we can insert breakpoints and evaluate expressions. This is really handy when writing parsers and scrapers. Note the evaluated expression result includes an href to “Thierry Rey” – A Judo gold medalist, so we know we’re on the right track!

Once we’re happy that we have the correct data coming out of our XPath query, we can bang the rest out in Python. This example selects Gold, Silver and Bronze medalists, but to simulate Luciano’s results, we’ll combine them all in to a single list:

XPath looks a bit messy, (work backwards with me) it’s just saying: “Get me the text() node, of the first[1] anchor <a> element in the second[2] <td> in every <tr> and <table> within elements with an attribute[@] and value of id=”mw-content-text”.

Post process extracted data

Finally we insert our tested XPath into our code, and the rest is straight forward Python. We can retrieve, manipulate and calculate on any of the list content. To simulate Luciano’s output, we’ll build a final list with total medal counts:

And we’re done, print the results!

Results

It worked! We get the same results as Luciano using just over a dozen lines of Python Code!

There’s no right or wrong way to extract data. Luciano’s method used pure command line tools, and that’s pretty neat. The Python and XPath method is very portable. It helped me significantly in my data collection for research.

Python example code ‘judograbber.py’

Thursday, July 09, 2020

In 2020, the “digital universe” holds an estimated 40 trillion gigabytes or 40 zettabytes worth of information. With such a great amount of available data to analyze, it is essential to associate it with web scraping technology, which is able to effectively reduce manual work and the operation cost in the first stage of the big data solution.

When talking about web scraping technology, Google spider could be the first thing that appears in our mind. However, it could be widely used in various scenarios. Before further discussion on the use cases here, this article will help you understand the working logic of the web scraping technology and how to quickly master web scraping skills.

How does a web scraper work?

All web scrapers work in the same manner. They first send a “GET” request to the target website, and then parse the HTML accordingly. However, there are some differences between using computer languages and web scraping tools.

The following code snappit shows an example of a web scraper using Python. You will find yourself inspecting the web structures most of the time during the scraping.

Line 1 is the GET part. It sends a “GET” query to the website and then returns all the text information.
Line 2 is to LOAD the text we request and present it in HTML format.
Starting from Line 3 till the end, we begin to PARSE the HTML structure and obtain the data/information we need.

If you think it looks like an alien’s workbook which doesn’t make any sense to a human being, I’m with you. The process of getting the web data shouldn’t be more complex than copying-and-pasting. This is where the magic of a web scraping tool comes into place.

If you have downloaded Octoparse and used it for a while, you should have tried Octoparse Task Template Mode and Advanced Mode. When you enter a target URL in Octoparse, Octoparse helps you to read it, which is regarded as sending a “GET” query of the target website.

No matter which mode you use to build a web scraper, the essential action is to parse the target website. A task template is a ready-to-use parser that is pre-built by the Octoparse crawler team while a customized task requires users to click to create a parser.

How to create a scraper from scratch?

We have learned the basic working logic of the scraper in the previous part, now we can start practicing how to create a scraper from scratch. In this part, you’re going to learn 2 methods:

Method 1: Build a scraper with Octoparse 8.1

Auto-generated scraper: enter the target URL to get data

Pycharm Features

Method 2: Build a scraper with Python

Step 1: Inspect your data source
Step 2: Code the GET Part in Pycharm
Step 3: Code the PARSE Part in Pycharm

Task description: to make the practice more newbie-friendly,

Target website: https://www.imdb.com/india/top-rated-indian-movies/

In this case, our task is to scrape the list of the

Build a scraper with Octoparse 8.1

Auto-generated scraper: enter the target URL to get data

Build a scraper with Python

Step 1: Inspect your data source

Simply click “F12” to open the Chrome code developer to inspect the HTML. We Can figure out the request URL which contains the data we need. Here, we can see that the URL we selected contains all the data we want.

Step 2: Code the GET Part in Pycharm

Before coding the spider part, we need to import “some resources” which is a Python library as follows. So does the target URL, https://www.imdb.com/india/top-rated-indian-movies/

This function is made to get the data from IMDB's top Indian movies link also it program converts into JSON format it basically it program gives us position, movie URL, movie name, movie year, and rating.

Step 3: Code the PARSE Part in Pycharm

There are two ways to achieve the PARSE Part: Using Regex OR a Parse tool, like Beautiful Soup. In this case, to make the whole process easier, we use Beautiful Soup. After installing the Beautiful Soup into your computer, simply add the two lines which are highlighted in yellow to your Python file.

Pycharm Web Scraping Tutorial

With the above steps, the IMDB task is done! All you need to do is to run the code and store the data to your computer.

Final thoughts

All in all, creating a crawler and conducting data scraping is not the exclusive field for the programmers anymore. More and more people who have barely coding background can scrape the online data with the assistance of some cutting-edge tools, like Octoparse.

Now, it’s easier for us to step into the big data area with the assistance of tools than before. Maybe, at this time, what we need to further consider is what value we can get from the data and information we get online.

Pycharm Web Scraping Software

Author: Erika