Crawling and web scraping a site without getting easily detected or blocked can be extremely challenging. If you are running a web scraping mission for gathering e-commerce intelligence data for your business purposes, you might want to keep in mind a handful of useful tips on how to prevent getting blacklisted while scraping.
In this article we will reveal some details about performing web scraping without getting blocked or blacklisted as well as some the scraping best practices. It also will be particularly interesting for those who would like to know more about setting up proxies for scraping and crawling missions to avoid detection. We will start by defining the difference between web scraping and crawling and expand into a list of tips to help you avoid getting blocked while scraping websites.
Web Crawling vs Web Scraping: Definition and the Difference
Web crawling and scraping are very closely related, so no wonder that people often confuse them. To avoid this confusion let’s consider some examples of use cases where web scraping and web crawling are involved, so you would have a better picture about these two data mining web techniques.
Web scraping and its use cases
When we are talking about web scraping, we normally refer to extracting data from the target websites. This process can be automated through the use of various bots and scripts that on the one hand simplify this process and, on the other hand, make scraping as efficient as possible.
Web scraping is usually used for missions that involve fetching data from e-commerce sites (such as Amazon) when you need to extract information on prices and features of various products. It is also helpful in data extraction related to business leads, stock market data or real estate listings.
Web crawling: definition and its use cases
Web crawling, on the other hand, is all about indexing information on the web or in some repositories. It is performed by web crawlers (sometimes referred to as “spiders”) that systematically browse the World Wide Web for indexing the information.
A good example of web crawling is what browsers like Google, Yahoo or Bing perform on the web: accessing websites one by one and indexing all the information contained on the websites.
From the business perspective all web missions that don’t involve data extraction but indexing can be qualified as web crawling.
Tips on how to crawl a website without getting blocked or blacklisted
Below are some of the most useful tips on how to crawl a website without getting blocked and on how to avoid IP ban while scraping.
1. Using a proxy server
Web crawling and scraping can be performed much more effectively if you use proxy servers that change your IP for accessing the targeted websites. In this case you should avoid using public proxies and rather choose a reliable proxy provider to supply you with datacenter or residential proxies specifically designed for scraping or crawling.
When using reliable proxies you will avoid getting blocked while crawling websites plus you will be able to override geo restrictions that may be imposed by the site. For instance, you can use UK proxies while being located in Canada or elsewhere.
But make sure you will consult with the account managers of your proxy provider to give you the best selection of proxies for your particular web scraping mission.
2. Rotating your IPs
Why is it so important to rotate your IPs when you access websites for scraping? Because a website might get suspicious when a lot of requests start coming from the same IP. In this case the site might start taking actions and you may get your IP address blocked as being a threat to the site’s business.
Rotating proxies help solve this problem by changing IPs for each session you send a request. This way the site will see you as a different user each time you try accessing the site and will not block you during scraping.
At Privateproxy we can offer you a pool of rotating proxies. Please consult with our account managers to provide you with just the right proxies for your scraping use case.
3. Checking for robot exclusion protocols
When crawling or scraping a website there are two basic rules that you have to follow with respect to the robots exclusion protocol file (robots.txt) containing the information on the rules of a certain website.
One, you have to respect these rules while scraping not to harm the integrity of the page, and, two, you want to be more careful about scraping or crawling sessions to conduct them during off-peak hours for the site. It will ensure website crawling without getting blocked.
4. Using user agents
A user agent is essentially a string of HTTP code that enables the target site to identify your operating system, browser and the type of device that you are using. Some sites analyze this information to sort out malicious access requests.
In order to avoid blocking while scraping, you need to update your user agent so that it will appear legitimate and up-to-date. If your bot uses a user agent that is several years old and no longer supported by browsers, it will definitely raise some red flags and you will be banned from the site.
If you need to set up your user agent correctly, contact our support team for suggested options.
5. Setting your fingerprints correctly
Websites out there are getting seriously concerned about preventing scraping by bots. In order to implement it on a new level, they came to using what is known as Transmission Control Protocol (TCP) or IP fingerprinting to protect against bots.
TCP was implemented so that sites could collect information on the operating system, browser settings and other parameters of your device to create a fingerprint of your system to identify it later.
We have written an article on how to avoid blocking while scraping when a site starts tracking fingerprints. Please study it carefully to make sure that your scraping and parsing missions run smoothly.
6. Evading honeypot traps
Honeypots are links specifically placed on a website by webmasters to detect web scrapers. These honeypots are invisible to organic users. Only bots and scrapers can see and follow them. Such honeypots are used to get your crawler detected and block it from the site.
The practice of setting up honeypots is not so common, since it requires some intricate technical development on the side of webmasters. But if your crawler is getting blocked right after starting working on such a site, you should know that it could happen due to encountering a honeypot.
7. Overriding CAPTCHAs
A Captcha is yet another tool sites use to avoid scraping and parsing. In fact it is one of the most challenging hurdles a crawler might encounter while starting on its mission. A Captcha is often presented as a set of images an organic user should select from to enter a website. Its main function is to fend off bots from the site.
So, how do you overcome a Captcha on a website when you need to scrape it using a bot. There are various tools that help you do just that. You can use such integrated solutions as Scraper API or narrow down Captcha’s solving solutions using services like 2Captcha or AntiCAPTCHA.
Although, some of the Captcha solving tools are rather slow and expensive, so you might need to consider the economical viability of such applications for scraping operations.
8. Changing the crawling pattern
Avoid a crawler being automatically blocked by the site while using a similar crawling pattern all the time.
If your crawler has a single pattern for crawling it will be a matter of time before you get detected by the site based on the predictability of your bot’s activities.
We suggest that you should make intermittent changes into the behaviour of your crawler by adding some random clicking or scrolling. A good practice here would be to imitate an organic user’s behaviour on the site. Visiting the site yourself to study its structure may be also helpful before engaging in a lengthy scraping session.
9. Reducing the speed of scraping
If you want to decrease the risk of getting blocked, you might need to consider reducing the speed of scraping or building in some breaks in the process to slow down the process.
It is important because the site may have a limitation set on the number of actions from an IP accessing a certain part. Reducing the speed of scraping will tell the site that you are a legitimate user without an intention to break the site’s rules.
10. Crawling during off-peak hours
If you want to know how to crawl a website without getting blocked while scraping you should remember that crawling is normally performed at much higher speed than usual browsing. That may significantly affect the server’s load and make you visible to the site. In order to avoid such situations consider crawling and scraping during off-peak hours for the site. Normally, it would be the time around midnight for the site’s server. Just make sure that you account for your site’s timezone while doing it.
11. Avoiding scraping images
Unless it is absolutely necessary, you should avoid scraping images. On the one hand, images are very “heavy” objects for scrapers that may reduce the overall speed of scraping, and on the other hand, the images can be protected by copyright and scraping them may create a higher rate of copyright infringement.
13. Using a headless browser
Headless browser is a browser that functions without a GUI (Graphic User Interface). It is yet another tool that can make scraping easy and efficient. Both Firefox and Google feature headless browser modes to perform such operations.
14. Setting a referrer
In order to avoid any kind of problem during scraper you should consider setting a referrer header that will show the target site where you are coming from. It certainly helps to set the referrer header to look like you are coming from Google (e.g. “Referer”: “https://www.google.com/”). Also, consider changing the referrer to the site that would reflect the desired geographical location. For instance, if you scrape a site in the UK, change the referrer to: https://www.google.co.uk/.
You can also alter the header to some popular social media website like Facebook or Youtube, so that your request would look more authentic for the webmasters expecting traffic from these sites during such hours.
15. Detecting website changes
Another reason to make your scraper crash could be the change in the site’s layout. It could be an unexpected change or a part of the site’s migration to a different layout or design. You need to be careful here to detect such changes to finetune the crawler to adapt to them. Sometimes, regular checks of the crawling results to verify the quality of scraping will suffice.
Another solution here would be having a unit test for the URL (for each type of page) to verify its consistency. this way you can check a site breaking crawling using a few requests not after hours of resultless crawling.
16. Scraping out of the Google cache
Sometimes, it may be a good idea to stop web scraping the site itself and instead engage in scraping out the site’s data right from the Google cache. For this, simply add “http://webcache.googleusercontent.com/search?q=cache:” to the beginning of the URL.
This can be a good solution for crawling for non-time sensitive information from a certain site that is hard to scrape directly. While it may seem like a reliable solution, since you avoid all site’s scraping protection tools, you may face another problem: some websites force Google not to leave cache data (like Linkedin, for instance) and, therefore, such scraping may be useless.
Get 100% Clean DC & Residential ProxiesContact Us
Consider Privateproxy.me as your trusted proxy partner for scraping
We at Privatepoxy.me make an emphasis on supplying you with just the right tools for the job. While proxies can be a super effective tool for scraping and crawling, you need to take into account various factors described above as well. But whenever it comes to picking a good pool of proven proxies for your mission, you can always rely on us and our account managers for support. What is definitely important in supplying you with the right proxies for the job is to avoid blocked proxies that were exposed to the Web while scraping before.
We hope that by now you know more on how to prevent blacklisting when scraping and crawling a website without getting blocked. While sometimes it will be enough just to engage proxies and use proper HTTP request headers to avoid problems while scraping, it is good to know that there are a few more tips on how to increase the efficiency of scraping just in case.
Also, do not forget to be respectful to the site’s rules and the website’s layout. After all, your mission is to collect the information and not to hinder the site’s functionality.
So, if you need to engage in a scraping mission anytime soon, please contact our account managers to advise you on the use of proxies for scraping and crawling.
Frequently Asked Questions
Please read our Documentation if you have questions that are not listed below.
Can you crawl any website?
Although some website owners make it a real nightmare for scraping their sites by placing technical shields on their sites, it is true to believe that just about any website can be scraped. If the number of scraper traps, captchas and other layers of defence is enormous, you need to realize that scraping is not welcome here. Technically it is possible to overcome any of these defences but do you really want to? If a website is super proactive against scrapers then it’s not a good idea to scrape it anyway.
What is crawler in web scraping?
A crawler (crawler bot, or crawler spider) is a script that browsers sites in the automatic mode and creates search engine indices. It is what web search engines like Google or Bing use to index all sites of the Web. Unlike scraping, when you transfer data from a target site, crawling merely indexes this information.
Is crawling a website illegal?
Crawling or scraping a website is not illegal unless you access some sites violating the Terms of Service of such sites. The web scraper that you use should not log in to the site not to violate the ToS (Terms of Service), since such terms may forbid activities like data collection from the site. Also, there is a misconception that scraping all types of public data is legal. Some of the images, video files and articles can be considered creative works and therefore copyrightable material. You cannot freely set up your web scraper API to scrape and use such information. Terms of Service of some sites may explicitly prohibit scraping and in this case it can be deemed illegal in such cases.
Can websites prevent scraping?