Question

对于我的项目，我需要Google搜索结果。我正在使用python请求和BeautifulSoup。我得到了结果，但它们与我在浏览器中看到的结果不同。我需要显示在浏览器中的确切内容。我也尝试了urllib。但这也与网络结果不同。谁能帮我解决这个问题？

import requests
import bs4

link = 'https://www.google.com/'
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }
response = requests.get(link, headers = headers)
soup = bs4.BeautifulSoup(response.text, 'lxml')

Answer 1

大多数网站都运行javascript来更新网站。他们中的一些人还试图检测爬虫。

使用headless browser代替进行爬网。

如评论中所述，某些站点还使用cookie。例如，google搜索结果之所以如此好，主要是因为它们是为用户定制的。

Web爬网HTML与浏览器结果不同

1 个答案: