Build a Web Crawler in Python
Web Crawling 101
A web crawler, also known as a web spider, is a program that scans the World Wide Web and extracts information from websites. The crawling procedure begins by providing the “spider” with a list of URLs to visit. The spider then discovers new pages by finding additional links to other webpages. When the crawler discovers a new page during its crawl, it analyzes it and identifies all the new links the page contains, adding them to it’s original list of URLs to visit. This process continues recursively as long as new resources are found. By calling it a “Spider” and saying that it “crawls,” we arouse a visualization that eases our understanding of the programs function. A web “spider” “crawls” webpages, continuously discovering new links that define its route. It travels from site-to-site and extracts information that you specify in the program.
The most popular crawler is Googlebot. Googlebot is Google’s famous crawler that discovers new and updated webpages to be added to the Google index. Once added, these pages are kept in a database and organized in a way that they can be summoned easily when a relevant search query is entered.
Web crawlers have many applications and are needed in a variety of different fields. In network security they’re used to asses and understand a target web application. The spider crawls the web app and returns relevant information to the pentester. In digital marketing, specifically technical SEO, crawlers are used to assess website structure, relevant HTML tags, and broken hyperlinks. For the data scientist or curious tinkerer,the crawler puts endless website data at your disposal for analysis or, well.. whatever you want.
Before We Begin: Scrapy vs. BeautifulSoup
We’re going to write the crawler in Python. Here we have two import options.
- Scrapy is a web scraper framework. You give Scrapy a root URL to crawl, and then specify constraints about the crawl. It is a complete framework for web scraping and crawling.
- BeautifulSoup is a parsing library. Unlike Scrapy, it only fetches the contents of the URL and then stops. It doesn’t manually crawl unless you put it into a loop with specified criteria.
Scrapy is a complete framework built solely to crawl websites, whereas BeautifulSoup is a parsing library that offers much of the same functionality. For ease of use lets use scrapy.
Step 1: Install Scrapy
Before installing Scrapy, make sure you have the latest version of Python installed. If you using a Mac or Windows operating system, it should already be installed. If not, please visit www.python.org for further instructions.
Now that Pythons installed and updated, open up your terminal and enter
pip install scrapy
If there are any permission errors make sure to add sudo before the command like such
sudo pip install scrapy
Enter your password to continue as root user and the download should be seamless.
Step 2: Begin Project
Now open the folder you would like to begin the project in and run the following command
scrapy startproject your_project_name
This command will create the crawlers’ basic files. You should see this:
It looks a bit intimating, luckily we only need to configure two of the files in order to create an operational web spider: items.py and a file we create in the “Spiders”subdirectory.
Since we aren’t passing anything in this example, you can delete the pass statement seen above. PatbotItem is the class that will store the data we will extract from the website. For simplicity sake, we will only retrieve the title of a post. After deleting the pass statement, add the following:
The spider file specifies the programs’ exact crawling instructions. Open the “spiders” folder in the crawlers root directory and create a file. Mine will be spiderman.py. You can name the file whatever you’d like. The imports consist of
- Spider is a basic crawling class.
- PatbotItem is the class we just created in items.py that will store the data we extract.
- Request is the class the enables us to recursively crawl a page.
- Selector will help us extract data using cross path
The next thing is the heart of our crawler: the spider class. It’s a derived class of BaseSpider which has three fields:
- name: name of your spider, which is use to launch the spider
- allowed_domains: a list of domains of which the crawler is allowed to go.
- start_urls: a list of URLs, which will be the roots of later crawls.
Now we need to drop the parser in:
- parse(self, response) is main method which is invoked by BaseSpider and contains the main logic of our crawler.
And Lastly define the variable that will contain results from cross path query.
Congratulations, you’ve successfully built a web crawler. In order to run it, open up your terminal, change directories to the folder with the crawler, and drop in
scrapy crawl your_project_name
Eh- the data returned is a bit unorganized if you ask me. In order to export the data returned from the crawl into a CSV file, run the following command instead:
scrapy crawl tutsplus -o yonewdata.csv -t csv
“Wait..But It Only Indexted the Root URL?”
Good eye. The crawler we just built is not recursive, meaning it will only investigate the root page we gave it -https://www.packtpub.com/- without adding new links to index.
Now lets add the following to the original code in the spiderman.py file:
All Together Now: The Recursive Web Crawler
Items.py file, where we created the class that will store the extracted data:
And the file we created in the “Spiders” subdirectory that determines the crawler’s rules: