我从Python和BeautifulSoup开始。我想用BS抓取一个网站,但我不理解我的代码结果以及find
和find_all
的使用。我想在href
标签中获得一个URL。
<div class="xBRiJc">
<a href="https://play.google.com/store/apps/collection/cluster?
clp=igNLChkKEzc4NDcxODQ2MTE5MjkxMDc4NTgQCBgDEiwKJmFhZGVtby5zdXBlcmF3ZXNvbWUudHYuYXdlc29tZWFkc2RlbW8yEAEYAxgB:S:ANO1ljKZ36s&gsr=Ck6KA0sKGQoTNzg0NzE4NDYxMTkyOTEwNzg1OBAIGAMSLAomYWFkZW1vLnN1cGVyYXdlc29tZS50di5hd2Vzb21lYWRzZGVtbzIQARgDGAE%3D:S:ANO1ljKKOPI"> .
<h2 class="C7Bf8e bs3Xnd">SuperAwesome LTD</h2></a></div>
这是我的python代码:
developer_link = bs.find("div",{"class":"xBRiJc"})
print(developer_link.get('href'))
为什么我的print
命令“无”的结果而不是href
标签中的URL的结果?
答案 0 :(得分:1)
您将[ERROR] Failed to execute goal on project testdep: Could not resolve dependencies for project com.my.deptest:testdep:jar:0.0.1-SNAPSHOT: Failure to find com.mydeptest:testdep:jar:1.0
定义为包含链接的developer_link
标签,而不是链接本身。由于div标记本身没有“ href”参数,因此<div>
将返回developer_link.get('href')
。因此,您只需要更进一步:
None
不过,看这个例子,我猜测div的类是动态生成的。如果是这样,那么当您重新访问页面时,div的类可能不是“ xBRiJc”,这意味着它不是链接的可靠标识符。如果您只是想获取文本包含“ SuperAwesome LTD”的第一个链接,则可以使用一些正则表达式技巧仅基于此链接。但是,如果您知道链接的内部直接有一个H2标签,其实际文本为“ SuperAwesome LTD”,那么您可以这样做:
>>> pagecode = """
... <div class="xBRiJc">
... ... <a href="https://play.google.com/store/apps/collection/cluster?
... ... clp=igNLChkKEzc4NDcxODQ2MTE5MjkxMDc4NTgQCBgDEiwKJmFhZGVtby5zdXBlcmF3ZXNvbWUudHYuYXdlc29tZWFkc2RlbW8yEAEYAxgB:S:ANO1ljKZ36s&gsr=Ck6KA0sKGQoTNzg0NzE4NDYxMTkyOTEwNzg1OBAIGAMSLAomYWFkZW1vLnN1cGVyYXdlc29tZS50di5hd2Vzb21lYWRzZGVtbzIQARgDGAE%3D:S:ANO1ljKKOPI"> .
... ... <h2 class="C7Bf8e bs3Xnd">SuperAwesome LTD</h2></a></div>
... ... """
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(pagecode, 'lxml')
>>> div = soup.find("div", class_="xBRiJc")
>>> link = div.find("a")
>>> print(link.get('href'))
https://play.google.com/store/apps/collection/cluster?
... clp=igNLChkKEzc4NDcxODQ2MTE5MjkxMDc4NTgQCBgDEiwKJmFhZGVtby5zdXBlcmF3ZXNvbWUudHYuYXdlc29tZWFkc2RlbW8yEAEYAxgB:S:ANO1ljKZ36s&gsr=Ck6KA0sKGQoTNzg0NzE4NDYxMTkyOTEwNzg1OBAIGAMSLAomYWFkZW1vLnN1cGVyYXdlc29tZS50di5hd2Vzb21lYWRzZGVtbzIQARgDGAE%3D:S:ANO1ljKKOPI