Many people like to use Selenium or Puppeteer (Pyppeteer) to write crawlers by simulating a browser, thinking that this way they can not be detected by the website, and they can crawl whatever data they want.

But in fact, Selenium-enabled browsers have dozens of characteristics that can be detected by websites through JavaScript. The browser launched by Puppeteer also has many characteristics that can be detected by websites.

If you don't believe me, let's do an experiment. First, you use a normal browser to open the following URL: https://bot.sannysoft.com/. As you can see, the content of the page is as follows:

image.png

This page is very long, you have to scroll down to read it. Most are green.

Next, use Selenium to start a Chrome headed mode, and then open this page to see the effect:
image.png

At the beginning, the item WebDriver is marked red, indicating that the website has successfully detected that you are using a simulated browser. If you scroll down, all the features marked in red are detectable.
image.png

On the left is the normal browser and on the right is the simulated browser. If you compare one by one, you will find that many places are different.

This is still the effect of head mode. Let's take a look at the headless mode:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")

driver = Chrome('./chromedriver', options=chrome_options)
driver.get('https://bot.sannysoft.com/')
driver.save_screenshot('screenshot.png')

After the screenshot is opened, it will look like this. Don't be scared:
image.png

So many features are directly exposed, you still hide a fart. As long as the website wants to find you, it is very easy.

Since Selenium doesn't work, what about Puppeteer or Pyppeteer? We use Pyppeteer to do an experiment. Start headless mode directly and take a screenshot. The running effect is as follows:

image.png

It's no different than Selenium.

So, are you still ashamed to continue to use these two things to write crawlers? Crawling small websites without security awareness is fine. Climb companies with strong security teams and legal teams, ...

Likes(0)

Comment list count 0 Comments

No Comments