I had to admit that I did things in an entire reversely way. I firstly build a web scraping project without questioning myself about the points to consider before jumping into it.
This article aims to provide information about what to know before working on a web scraping project from my experience. In addition, I also did some research on the topic.
Is it legal to web scrap websites?
The first question I asked myself was: is this thing called web scraping I'm doing legal? After some research, I found that scraping websites where no account is required for access or where the user has made the data public was not problematic.
So now, it is clear that you can scrap data from websites if there are public.
Look at the robots.txt file.
Websites use the Robots.txt file to define if or how a bot should crawl or scrape the website. The file is accessible using the URL of the website by doing example.com/robots.txt.
When working on a web scraping project, you can access it to find if the website allows web scraping.
Consider the quantity and the speed of extraction
Don't do excessive web scraping. It is essential to web scrap data with a time interval between two extractions. Extracting too fast like a robot without pause can cause the website to block you.
DISCLAIMER: You may still get blocked if the server detects you are trying to scrape large amounts of data with your script;
Here is some information to consider before working on a web scraping project.