从具有特定类的<div>标记中获取所有<a>标记

时间:2019-07-31 04:18:52

标签: python web-scraping beautifulsoup

我正在调试并在“使用python自动完成无聊的工作”中获取lucky.py代码。这里的主要问题是作者的代码不起作用(可能已过时)。该代码旨在在执行python脚本时传递命令行参数。该脚本会在新标签页中打开该参数的前五个(或更少)Google搜索结果。现在,原始代码将提取所有带有'r'类的标签。但是,现在,谷歌不再使用“ r”类来搜索结果超链接,而是将“ selfsame”标签用“ r”类包装在div中。

这就是原始代码所做的

res = requests.get('http://google.com/search?q=' +' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')

linkElems = soup.select('.r a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    webbrowser.open('http://google.com' + linkElems[i].get('href'))

我尝试过将所有直接包含在divs中的标签提取出来,但是我找不到任何方法来提取直接包含在'r'类标签中的所有标签。

有些事情我已经想到了,但是它们不能正常工作。

linkElems = soup.select('.r div > a')

,因为我想要的所有标签都具有以'\ url开头的ping属性。

 linkElems = soup.select('a')
 for link in linkElems:
     if link.attrs.hget('ping').startswith('\\url'):
         ...

3 个答案:

答案 0 :(得分:1)

TLDR :从python脚本运行时,Google发送不同的HTML响应。

好吧,如果您实际打印linkElems变量,您将看到它为空。我认为这是因为Gooogle根据许多HTTP标头更改了它们的HTML。用外行术语来说,这意味着您在浏览器中看到的HTML并不是从Python运行获取请求时将获得的HTML。

现在您可以使用linkElems = soup.select('.jfp3ef > a'),它将正常工作。它将选择所有<a>标记,它们是元素.jfp3ef的元素的直接子代。当从python发出请求时,.jfp3ef类是Google似乎在使用的类,而不是r。但是我不会将其投入生产,因为它可能会不时更改。

更好和更可靠的解决方案是使用Google Search API。但是由于您是出于学习目的而这样做的,所以我上面提到的hack应该没问题。

代码:

import bs4
import requests

res = requests.get('http://google.com/search?q=test')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
linkElems = soup.select('.jfp3ef > a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    print('http://google.com' + linkElems[i].get('href'))

输出:

http://google.com/url?q=https://www.speedtest.net/&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjAKegQIChAB&usg=AOvVaw0mhIK0jUq5fUfhEJTuA90h
http://google.com/url?q=https://fast.com/&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjALegQICRAB&usg=AOvVaw3WERIy0Wo_UNyqmNAVBCeZ
http://google.com/url?q=https://openspeedtest.com/Get-widget.php&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjAMegQICBAB&usg=AOvVaw1161mhQBhD75gfmsIzzg4n
http://google.com/url?q=https://www.meter.net/&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjANegQIBxAB&usg=AOvVaw2Z3xTSmhoxz6VS7MYAaS2x
http://google.com/url?q=https://speedtest.telstra.com/&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjAOegQIARAB&usg=AOvVaw36SosexF66e8fQUWIG14mZ

答案 1 :(得分:0)

此代码对我有用

soup = BeautifulSoup(res.text, "html.parser")
for div in soup.find_all("div", {"class": "class name"}):
    for a in div.find_all("a", {"class": "r"}):
        print(a.attrs['href'])

您可以使用tags name功能获得全部find_all(),如果您想使用特定的tags获得全部attribute,则应发送另一个dict作为输入到find_all()功能。

答案 2 :(得分:0)

是的,这篇文章似乎已经过时了。没有标签为r类的标签(至少在我看来是这样),但是您仍然可以通过href属性选择链接。

要选择以<a>开头的具有href属性的所有/url标签,可以使用CSS选择器a[href^="/url"]

import bs4
import requests

search_term = 'tree'

res = requests.get('http://google.com/search?q=' + search_term)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')

for link in soup.select('a[href^="/url"]'):
    print(link['href'])

打印:

/url?q=https://en.wikipedia.org/wiki/Tree&sa=U&ved=2ahUKEwj4iMW3rN7jAhWJxMQBHag1Cr4QFjAGegQIBxAB&usg=AOvVaw3paXH3cMIxBpu9X0bAY3mR
/url?q=https://en.wikipedia.org/wiki/Tree_line&sa=U&ved=2ahUKEwj4iMW3rN7jAhWJxMQBHag1Cr4Q0gIwBnoECAcQAg&usg=AOvVaw3ynJgH_Bbw1mSqAL8ovO7e
/url?q=https://en.wikipedia.org/wiki/Tree_(disambiguation)&sa=U&ved=2ahUKEwj4iMW3rN7jAhWJxMQBHag1Cr4Q0gIwBnoECAcQAw&usg=AOvVaw1Dcz4l8mkB9jZHqeJKT9B9
/url?q=https://en.wikipedia.org/wiki/Portal:Trees&sa=U&ved=2ahUKEwj4iMW3rN7jAhWJxMQBHag1Cr4Q0gIwBnoECAcQBA&usg=AOvVaw0mZS3EU93_a96SpiqfFG-R
/url?q=https://en.wikipedia.org/wiki/I-Tree&sa=U&ved=2ahUKEwj4iMW3rN7jAhWJxMQBHag1Cr4Q0gIwBnoECAcQBQ&usg=AOvVaw2lq87vNdcDmw0tCZxeIs_E

... and so on.

编辑:要过滤掉IMG链接和内部帐户,您可以执行以下操作:

for link in soup.select('a[href^="/url"]'):
    if link.find('img'):
        continue
    if 'accounts.google.com' in link['href']:
        continue
    print(link['href'])

打印:

/url?q=https://en.wikipedia.org/wiki/Tree&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQFjAGegQIAxAB&usg=AOvVaw213y4pDofhSr3-AzbeN6Xe
/url?q=https://en.wikipedia.org/wiki/Tree_line&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQ0gIwBnoECAMQAg&usg=AOvVaw0qQCjrcrP6YHGLeeSvYkN1
/url?q=https://en.wikipedia.org/wiki/Tree_(disambiguation)&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQ0gIwBnoECAMQAw&usg=AOvVaw2OSqEJ_jRM_ByhjfvMSzjC
/url?q=https://en.wikipedia.org/wiki/Portal:Trees&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQ0gIwBnoECAMQBA&usg=AOvVaw1Xh2A4mp3beT6zQNzS8aJD
/url?q=https://en.wikipedia.org/wiki/I-Tree&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQ0gIwBnoECAMQBQ&usg=AOvVaw1ARsOn-3cMHsILu_-1AF-Q
/url?q=https://simple.wikipedia.org/wiki/Tree&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQFjAHegQICBAB&usg=AOvVaw3J9VoAcyvn01DK6VQjQOcJ
/url?q=https://simple.wikipedia.org/wiki/Tree%23Parts_of_trees&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQ0gIwB3oECAgQAg&usg=AOvVaw3uiAZjYQTYR02__Da6xkHi
/url?q=https://simple.wikipedia.org/wiki/Tree%23Records&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQ0gIwB3oECAgQAw&usg=AOvVaw2jexFkOqkPQ3bHZ1q1KdKj
/url?q=https://simple.wikipedia.org/wiki/Tree%23Tree_value_estimation&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQ0gIwB3oECAgQBA&usg=AOvVaw3URu63Yk-j0o-G75SSaeW3
/url?q=https://simple.wikipedia.org/wiki/Tree%23Tree_climbing&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQ0gIwB3oECAgQBQ&usg=AOvVaw2YmeOvTuDS2cacWiM7Fzj6
/url?q=https://www.royalparks.org.uk/parks/the-regents-park/things-to-see-and-do/gardens-and-landscapes/tree-map/why-trees-are-important&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQFjAIegQIARAB&usg=AOvVaw0uk4ZAk22_zyuVRPmGGEae
/url?q=https://www.homedepot.com/b/Outdoors-Garden-Center-Trees-Bushes/N-5yc1vZc8rq&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQFjAJegQIAhAB&usg=AOvVaw1v36Vzsvx9s-0BPWGp3QrH
/url?q=https://www.britannica.com/plant/tree&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQFjAKegQIABAB&usg=AOvVaw101wIJj19V4TEj57BCA7Xe
/url?q=https://www.nparks.gov.sg/trees&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQFjALegQIBBAB&usg=AOvVaw3CDs1obwYNKnMwtMK2RBbG
/url?q=https://en.wiktionary.org/wiki/tree&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQFjAMegQIBxAB&usg=AOvVaw3AJJuZ5vY3I8TqOSfKtVa4
/url?q=https://www.bbc.com/news/uk-england-47541491&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQFjANegQIBRAB&usg=AOvVaw1d2QTAZ5JYAB9t6f11VY-_
/url?q=https://www.theguardian.com/world/2019/jul/29/ethiopia-plants-250m-trees-in-a-day-to-help-tackle-climate-crisis&sa=U&ved=2ahUKEwj9m9KPsN7jAhXwxcQBHb7eDcIQFjAOegQIBhAB&usg=AOvVaw0c6bDr70Km_E8v3wmey124