Question

任何人都可以帮助我用漂亮的汤来穿越一棵html树吗？

我正在尝试通过html输出进行解析，然后收集每个值，然后使用python / django

插入名为Tld的表中

<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>

并且只解析href的{{1}}属性的值，所以只有这一部分：

<a>

的：

https://billing.anapp.com/

我目前有：

<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>

上面的问题是for url in urls: mb.open(url) beautifulSoupObj = BeautifulSoup(mb.response().read()) beautifulSoupObj.find_all('h3',attrs={'class': 'r'})，对find_all元素来说还不够。

非常感谢任何帮助。谢谢。

Answer 1

from bs4 import BeautifulSoup

html = """
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
"""

bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
    print(i.attrs["href"])

打印：

https://billing.anapp.com/

h3.r a是css selector

你可以使用css选择器（我更喜欢它们），xpath或在元素中查找。选择器h3.r a将查找具有类h3的所有r，并从中获取a元素。它可能是一个更复杂的例子，比如#an_id table tr.the_tr_class td.the_td_class，它会找到一个id给定td的内部属于带给定类的tr，当然也在一个表中。

这也会给你相同的结果。 find_all会返回bs4.element.Tag的列表，find_all有一个递归字段，不确定是否可以在一行中执行，我个人更喜欢css选择器，因为它简单明了。

for elm in  bs.find_all('h3',attrs={'class': 'r'}):
    for a_elm in elm.find_all("a"):
        print(a_elm.attrs["href"])

使用Beautifulsoup和Mechanize从元素中解析href属性值

1 个答案: