Build your first web scraping spider in less than 30 minutes

Anuroobika K
4 min readJun 28, 2021

Web Scraping is an automatic method to obtain large amounts of data from websites. Most of the web data is unstructured in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

Computer code photo

Every web page has its own structure and web elements that because of which you need to write your web crawlers/spiders according to the web page being extracted.

Scrapy uses spiders, which are self-contained crawlers that are given a set of instructions. In Scrapy, it is easier to build and scale large crawling projects by allowing developers to reuse their code.

Open Anaconda Navigator and create a new environment (other than the base/root). You can give it any name. Let’s call it ‘virtual’.

Next, open terminal from that environment and install scrapy by running below code

>conda install -c conda-forge scrapy==1.6 pylint autopep8 -y

I got below error in this step:

and fixed it with this code:

>conda config — set ssl_verify false

#Make a directory folder for our project. Lets call it ‘project’,

>mkdir projects

#go to that directory

> cd projects

#Start a new project named-worldometer- the site from which we would scrap data

Snapshot of the website

>scrapy startproject worldometers

When you create a project, Scarpy guides you to start your first spider as shown below.

We can follow Scarpy’s instructions to create a spider called ‘countries’ where we could scrap list of country names from the website.

>cd worldometers

>scrapy genspider countries www.worldometers.info/world-population/population-by-country

Now, install ipython by writing below command on anaconda prompt,

Open Scarpy sheel with below command

>scrapy shell

You have to run a crawler on the web page using the fetch command in the Scrapy shell. A crawler or spider goes through a webpage downloading its text and metadata.

>fetch(“https://www.worldometers.info/world-population/population-by-country/”)

You will get a response like this

The crawler returns a response which can be viewed by using the view(response) command on shell:

view(response)

And the web page will be opened in the default browser.

Right-click on the element you want and click inspect. Here let’s inspect country names,

Developer tools of the browser will help you a lot with web scraping. You can see that the title of the page in <h1> tag and country names in <a> tag under <td> .

Now, let’s fetch title of the page using XPath using below code

>title=response.xpath(“//h1”)

# To view the title

>title

The same can be done using CSS Selectors with below code

>title_css = response.css(“h1::text”)

# to view only the text of the title

>title_css.get()

In order to get all the country names which is a list and not a single string like ‘Title’, we need to use getall() as shown below

>countries=response.xpath(“//td/a/text()”).getall()

To do the same using CSS, use below code

> countries_css=response.css(“td a::text”).getall()

Congratulations! You have built your first spider, learnt some scrapy shell commands, how spiders crawl websites and how to extract information using XPath and CSS Selectors.

--

--

Anuroobika K

Writes about data science topics in simple words and also enjoys writing about life skills. Connect on https://www.linkedin.com/in/anuroobika-k-905b8823/