使用Beautiful Soup解析时的“ None”属性

时间:2018-11-19 19:52:22

标签: python beautifulsoup

我从Python和BeautifulSoup开始。我想用BS抓取一个网站,但我不理解我的代码结果以及findfind_all的使用。我想在href标签中获得一个URL。

<div class="xBRiJc">
<a href="https://play.google.com/store/apps/collection/cluster? 
 clp=igNLChkKEzc4NDcxODQ2MTE5MjkxMDc4NTgQCBgDEiwKJmFhZGVtby5zdXBlcmF3ZXNvbWUudHYuYXdlc29tZWFkc2RlbW8yEAEYAxgB:S:ANO1ljKZ36s&amp;gsr=Ck6KA0sKGQoTNzg0NzE4NDYxMTkyOTEwNzg1OBAIGAMSLAomYWFkZW1vLnN1cGVyYXdlc29tZS50di5hd2Vzb21lYWRzZGVtbzIQARgDGAE%3D:S:ANO1ljKKOPI"> .   
 <h2 class="C7Bf8e bs3Xnd">SuperAwesome LTD</h2></a></div>

这是我的python代码:

    developer_link = bs.find("div",{"class":"xBRiJc"})
    print(developer_link.get('href'))

为什么我的print命令“无”的结果而不是href标签中的URL的结果?

1 个答案:

答案 0 :(得分:1)

您将[ERROR] Failed to execute goal on project testdep: Could not resolve dependencies for project com.my.deptest:testdep:jar:0.0.1-SNAPSHOT: Failure to find com.mydeptest:testdep:jar:1.0定义为包含链接的developer_link标签,而不是链接本身。由于div标记本身没有“ href”参数,因此<div>将返回developer_link.get('href')。因此,您只需要更进一步:

None

不过,看这个例子,我猜测div的类是动态生成的。如果是这样,那么当您重新访问页面时,div的类可能不是“ xBRiJc”,这意味着它不是链接的可靠标识符。如果您只是想获取文本包含“ SuperAwesome LTD”的第一个链接,则可以使用一些正则表达式技巧仅基于此链接。但是,如果您知道链接的内部直接有一个H2标签,其实际文本为“ SuperAwesome LTD”,那么您可以这样做:

>>> pagecode = """
... <div class="xBRiJc">
... ... <a href="https://play.google.com/store/apps/collection/cluster?
... ...  clp=igNLChkKEzc4NDcxODQ2MTE5MjkxMDc4NTgQCBgDEiwKJmFhZGVtby5zdXBlcmF3ZXNvbWUudHYuYXdlc29tZWFkc2RlbW8yEAEYAxgB:S:ANO1ljKZ36s&amp;gsr=Ck6KA0sKGQoTNzg0NzE4NDYxMTkyOTEwNzg1OBAIGAMSLAomYWFkZW1vLnN1cGVyYXdlc29tZS50di5hd2Vzb21lYWRzZGVtbzIQARgDGAE%3D:S:ANO1ljKKOPI"> .
... ...  <h2 class="C7Bf8e bs3Xnd">SuperAwesome LTD</h2></a></div>
... ... """
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(pagecode, 'lxml')
>>> div = soup.find("div", class_="xBRiJc")
>>> link = div.find("a")
>>> print(link.get('href'))
https://play.google.com/store/apps/collection/cluster?
...  clp=igNLChkKEzc4NDcxODQ2MTE5MjkxMDc4NTgQCBgDEiwKJmFhZGVtby5zdXBlcmF3ZXNvbWUudHYuYXdlc29tZWFkc2RlbW8yEAEYAxgB:S:ANO1ljKZ36s&gsr=Ck6KA0sKGQoTNzg0NzE4NDYxMTkyOTEwNzg1OBAIGAMSLAomYWFkZW1vLnN1cGVyYXdlc29tZS50di5hd2Vzb21lYWRzZGVtbzIQARgDGAE%3D:S:ANO1ljKKOPI