Question

我目前正在学习网页抓取。今天，我尝试通过网络进行google.com搜索。当我尝试使用python请求库发出get请求时，它不能为我提供完整的响应。

例如，如果我调用此URL https://www.google.com/search?q=tea+meaning来获取tea单词的含义，则在结果响应中，它仅显示名词内容，而不显示动词内容。

from bs4 import BeautifulSoup as bs
import requests as req

headers_Get = {
    'Host': 'www.google.com',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

url = "https://www.google.com/search?q=tea+meaning"
response = req.get(url, headers=headers_Get)

data = response.text
soup = bs(data, "html.parser")

这里的问题是这种汤。它不包含动词内容。为什么会这样？

谢谢。

Answer 1

问题在于Google没有将搜索结果作为一页发回。在浏览器中，搜索结果中您看到的大多数内容都是单独的AJAX请求。您可能会在初始请求中得到一些部分数据，但不一定与常规浏览器中看到的数据匹配。

要了解“ Beautiful Soup and Requests”的内容，请尝试在关闭JavaScript的浏览器中打开链接。

Answer 2

您应该选择要打印的<div>。您将获得整个页面。

import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?q=tea+meaning"
header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36'}
page=requests.get(url,headers=header)

soup=BeautifulSoup(page.content,'html.parser')
result = soup.select_one('div.vmod').get_text()
print(result)

此代码显示所有内容，包括动词。嘿，如果您想获取含义，可以在https://developer.oxforddictionaries.com/尝试使用它

Python请求无法提供完整的响应

2 个答案: