How to crawl a website without getting blocked or misled (cloaked)?

Why should I care?
When a target website detects crawlers from a proxy (datacenter) IP, it typically

  • Blocks the IP, or
  • Presents the IP with purposely misleading information, or
  • Throttle down the response rate

How does the target website identify my crawling activity?
Target websites log the IPs of whomever visits them and analyzes the activity of these IPs. Assuming you are using a traditional data center proxy, the target website can:

  1. Identify that the activity from a single IP (the rate of requests) is much greater than what a human can accomplish in a given timeframe
  2. Identify that the IP address originated from a proxy server list, which these target websites have access to
  3. Identify that the IPs have the same subnet block range

How to prevent being detected?

  1. To prevent being detected by the amount of requests per IP, you can reduce the number of requests per second. However, this will reduce your crawling speed
  2. To prevent the target from identifying your IP as coming from a proxy server, you must rotate your requests through residential IPs. You should be able to circulate through enough IPs that the target website can not detect your activity
  3. When using residential IPs there is no subnet block range

By using a traditional proxy solution, it's only a matter of time before the target website will identify your crawling activities, and can block or provide you with the wrong information.

Was this article helpful?