用美丽的汤解析Google内卡

时间:2018-02-13 09:15:59

标签: python html css python-3.x beautifulsoup

我正在尝试解析g-inner-card class = "_KBh",但由于某种原因,它会返回一个空元组。 linkElems = soup.select('._KBh a')

linkElems = soup.select('._KBh a')

print(linkElems)

这将返回一个空元组[]

import webbrowser, sys, pyperclip, requests, bs4
if len(sys.argv) > 1:
    term = ' '.join(sys.argv[1:])
else:
    term = pyperclip.paste()
res = requests.get("https://www.google.com/search?q="+term)
try:
    res.raise_for_status()
except Exception as ex:
    print('There was a problem: %s' %(ex), '\nSorry!!')
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkElems = soup.select('._KBh a')
print(linkElems)
numOpen = min(3, len(linkElems))
for i in range(numOpen):
    print(linkElems[i].get('href'))
    webbrowser.open('https://google.com/' + linkElems[i].get('href'))

当输入命令行参数(即要搜索的字词)时,此代码段尝试在浏览器的3个不同窗口中打开最多3个Google搜索结果。它专门显示Google内卡的结果。

1 个答案:

答案 0 :(得分:0)

如果您打印res.text,则可以看到您没有从页面获取完整/正确的数据。发生这种情况是因为Google阻止了Python脚本。

要解决此问题,您可以传递User-Agent以使脚本看起来像真正的浏览器。

默认User-Agent的结果:

>>> URL = 'https://www.google.co.in/search?q=federer'
>>> res = requests.get(URL)
>>> '_KBh' in res.text
False

添加自定义User-Agent后:

>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
>>> res = requests.get(URL, headers=headers)
>>> '_KBh' in res.text
True

headers添加到您的代码中,会提供以下输出(您要查找的前3个链接):

https://www.express.co.uk/sport/tennis/918251/Roger-Federer-Felix-Auger-Aliassime-practice
https://sports.yahoo.com/breaks-lighter-schedules-help-players-improve-says-federer-092343458--ten.html
http://www.news18.com/news/sports/rafael-nadal-stays-atop-atp-rankings-roger-federer-aims-to-overtake-1658665.html