Web scraping tools are software, that is, bots programmed to examine databases and extract information. A variety of bot types are used, many of them fully customizable for:
- Recognize unique HTML site structures.
- Extract and transform content.
- Store data
- Extract data from the APIs.
Since all bots use the same system to access site data, it can sometimes be difficult to distinguish between legitimate bots and malicious bots. In this web scraping guide, we will explain the examples and protection against web scraping.
Examples of what the web scraping is
Web scraping is considered malicious when the data is extracted without the permission of the website owners. The two most common use cases are price scrapping and content theft.
1.- Price scraping
In scraping prices is one of the variants to know what is web scraping. It is an attacker who generally uses a bot network from which to launch web scraping bots to inspect the competition databases. The goal is to access price information, win over rivals and boost sales. For attackers, a successful scraping of prices can make their offers stand out on comparison websites.
Attacks occur frequently in industries where the price of products are easily comparable. Because the price plays an important role in buying decisions, the victims of scraping prices can be travel agencies, online electronics sellers, etc.
For example, electronic smart phone merchants, which sell similar products at relatively high prices, are frequent targets. To remain competitive, they have to sell their products at the best possible price.
Since customers always choose the cheapest option. To obtain an advantage, a provider can use a bot to continually scrape the websites of its competitors and almost instantly update its own prices accordingly.
2.- Scraping of content
Scraping content is another way to understand what web scraping is. That is, theft of large – scale content of a particular site. Typical objectives include online product catalogs and websites that rely on digital content to drive the business. For these companies, a content scraping attack can be devastating.
Protection against web scraping
- It is important to act legally
The easiest way to avoid scraping is to take a legal action. One in which you can report the attack in court and in which you prove that web scraping is not allowed.
You can even sue potential scrapers if you have explicitly forbidden it in your terms of service. For example, LinkedIn sued a set of scrapers last year, saying that extracting user data through automated requests amounts to hacking.
- Prevent attacks of requests that arrive
Even if you have published a legal notice that prohibits the scraping of your services, it is possible that a potential attacker still wants to go ahead with the process. You can identify possible IP addresses and prevent requests from reaching your service by filtering through the firewall.
Although it is a manual process, modern providers of cloud services give you access to tools that block possible attacks. For example, if you are hosting your services on Amazon web services, the AWS Shield would help protect your server from possible attacks.
- Use application forgery tokens
By using CSRF tokens in your application, you will prevent automated tools from making arbitrary requests to guest URLs. A token can be present as a hidden form field.
To circumvent a token, it is necessary to load and analyze the markup and find the correct token, before grouping it together with the request. This process requires programming skills and access to professional tools.
- Use the .htaccess file to avoid scraping
.htaccess is a configuration file for your web server. And it can be modified to prevent scrapers from accessing your data. The first step is to identify the scrapers, which can be done through Google Webmasters.
Once you have identified them, you can use many techniques to stop the scraping process by changing the configuration file. In general, this file is not enabled, so you must be enabled, only this way the files that you will place in your directory will be interpreted.
- Prevent hotlinking
When your content is scraped, the online links to the images and other files are copied directly to the attacker’s site. When the same content is displayed on the attacker’s site, that resource is linked directly to your website.
This process of showing a resource that is hosted on the server on a different website is called hotlinking. When you avoid an active link, an image of this type, when displayed in a different site, is not done through your server.
- Specific IP addresses of blacklists
If you have identified the IP addresses or IP address patterns that are used to scrape, you can simply block them through your .htaccess.