Since their inception, websites are used to share information. Whether it is a Wikipedia article, YouTube channel, Instagram account, or a Twitter handle. They all are packed with absorbing data that is accessible for anybody with access to the internet and a web browser.

But, what if we want to get any specific data programmatically?

There are two ways to do that:

  1. Using official API
  2. Web Scraping

The abstraction of API (Application Programming Interface) was alien to barter data amid altered systems in a accepted way. But, most of the time, website owners don’t accommodate any API. In that case, we are only left with the achievability to abstract the data using web scraping.

Basically, every web page is alternate from the server in an HTML format, acceptation that our actual data is nicely packed inside HTML elements. It makes the whole action of retrieving specific data very easy and straightforward.

This tutorial will be an ultimate guide for you to learn web abrading using Python programming language. At first, I’ll walk you through some basic examples to make you accustomed with web scraping. Later on, we’ll use that ability to abstract data of football matches from Livescore.cz .

Getting Started

To get us started, you will need to start a new Python3 activity with and install Scrapy (a web abrading and web ample library for Python). I’m using pipenv for this tutorial, but you can use pip and venv, or conda.

At this point, you have Scrapy, but you still need to create a new web abrading project, and for that scrapy provides us with a command line that does the work for us.

Let’s now create a new activity named web_scraper by using the scrapy cli.

If you are using pipenv like me, use:

Otherwise, from your basic environment, use:

This will create a basic activity in the accepted agenda with the afterward structure:

webrok

Building our first Spider with XPath queries

We will start our web abrading tutorial with a very simple example. At first, we’ll locate the logo of the Live Code Stream website inside HTML. And as we know, it is just a text and not an image, so we’ll simply abstract this text.

The code

To get started we need to create a new spider for this project. We can do that by either creating a new file or using the CLI.

Since we know already the code we need we will create a new Python file on this path /web_scraper/spiders/live_code_stream.py

Here are the capacity of this file.

webrok

Code explanation:

  • First of all, we alien the Scrapy library because we need its functionality to create a Python web spider. This spider will then be used to crawl the defined website and abstract useful advice from it.
  • We created a class and named it LiveCodeStreamSpider. Basically, it inherits from scrapy.Spider and that’s why we passed it as a parameter.
  • Now, an important step is to define a unique name for your spider using a capricious called name. Remember that you are not accustomed to use the name of an absolute spider. Similarly, you can not use this name to create new spiders. It must be unique throughout this project.
  • After that, we passed the website URL using the start_urls list.
  • Finally, create a method called parse() that will locate the logo inside HTML code and abstract its text. In Scrapy, there are two methods to find HTML elements inside source code. These are mentioned below.
  • CSS
  • XPath

You can even use some alien libraries like BeautifulSoup and lxml . But, for this example, we’ve used XPath.
A quick way to actuate the XPath of any HTML aspect is to open it inside the Chrome DevTools. Now, simply right-click on the HTML code of that element, hover the mouse cursor over “Copy” inside the popup menu that just appeared. Finally, click the “Copy XPath” menu item.

Have a look at the below screenshot to accept it better.

webrok