无法使用Python BS4从页面中查找文本

时间:2019-08-15 00:06:05

标签: python beautifulsoup python-requests

我正在尝试学习如何使用BS4,但是遇到了这个问题。我尝试在Google搜索结果页面中找到显示搜索结果数量的文本,但在html_pagesoup HTML解析器中都找不到文本“结果”。这是代码:

from bs4 import BeautifulSoup
import requests

url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

print(b'results' in html_page)
print('results' in soup)

两个打印都返回False,我在做什么错?该如何解决?

编辑:

发现网页的语言是个问题,将&hl=en添加到URL几乎可以解决该问题。

url = 'https://www.google.com/search?q=stack&hl=en'

第一张现在是True,第二张仍然是False

1 个答案:

答案 0 :(得分:1)

requests库以response.content的形式返回响应时,通常以原始格式返回。因此,要回答第二个问题,请将res.content替换为res.text

from bs4 import BeautifulSoup
import requests

url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.text
soup = BeautifulSoup(html_page, 'html.parser')

print('results' in soup)
Output: True

请记住,Google通常在处理刮板方面非常活跃。为了避免被阻止/验证,您可以添加用户代理来模拟浏览器。 :

# This is a standard user-agent of Chrome browser running on Windows 10 
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' } 

示例:

from bs4 import BeautifulSoup
import requests 
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser') 
...
<your code here>

此外,您可以添加另一组标题以伪装成合法的浏览器。添加一些其他标题,如下所示:

headers = { 
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip', 
'DNT' : '1', # Do Not Track Request Header 
'Connection' : 'close'
}