Question

我正在从“使用Python自动执行无聊任务”一书中做第一个webscraping教程示例。该项目包括在命令行上键入搜索词，让我的计算机自动打开一个浏览器，其中所有顶级搜索结果都在新标签中

它提到我需要找到

<h3 class="r">

来自页面源的

元素，它是每个搜索结果的链接。 r类仅用于搜索结果链接。

但问题是我无法在任何地方找到它，即使使用Chrome Devtools也是如此。任何帮助，在哪里都将非常感激。

注意：仅供参考，这是本书所见的完整程序。

# lucky.py - Opens several Google search results.

import requests, sys, webbrowser, bs4

print('Googling..') # display text while downloading the Google page
res= requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()

#Retrieve top searh result links.
soup = bs4.BeautifulSoup(res.text)

#Open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5,len(linkElems))
for i in range(numOpen):
    webbrowser.open('http://google.com' + linkElems[i].get('href'))

Answer 1

这对你有用：

>>> import requests
>>> from lxml import html
>>> r = requests.get("https://www.google.co.uk/search?q=how+to+do+web+scraping&num=10")
>>> source = html.fromstring((r.text).encode('utf-8'))
>>> links = source.xpath('//h3[@class="r"]//a//@href')
>>> for link in links:
        print link.replace("/url?q=","").split("&sa=")[0]

输出：

http://newcoder.io/scrape/intro/
https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/
http://docs.python-guide.org/en/latest/scenarios/scrape/
http://webscraper.io/
https://blog.hartleybrody.com/web-scraping/
https://first-web-scraper.readthedocs.io/
https://www.youtube.com/watch%3Fv%3DE7wB__M9fdw
http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/
http://analystcave.com/web-scraping-tutorial/
https://en.wikipedia.org/wiki/Web_scraping

注意：我正在使用Python 2.7.X，对于Python 3.X，您只需要像这样 打印（link.replace）包围打印输出（ “？/ URL q =”， “”）分裂。（ “＆安培; SA =”）[0]）

使用Python进行Webscraping（初学者）

1 个答案: