Question

我是Python新手并试图解析一个简单的HTML。但是，有一件事阻止了我：例如，我有这个html：

<div class = "quote">
<div class = "whatever"> 
some unnecessary text here 
</div>
<div class = "text">
Here's the desired text!
</div>
</div>

我需要从第二个div（文本）中提取文本。这样我就明白了：

print repr(link.find('div').findNextSibling())

然而，这会返回整个div（使用＆＃34; div＆＃34; word）：<div class="text">Here's the desired text!</div>

而且我不知道如何只获取文字。

添加.text会产生\u043a\u0430\u043a \u0440\u0430\u0437\u0440\u0430\u0431字符串\
添加.strings会返回"None"
添加.string会同时返回"None"和\u042f\u0445\u0438\u043a\u043e - \u0435\u0441\u043b\u0438\

repr

可能存在问题

P.S。我也需要在div内保存标签。

Answer 1

为什么不根据<div>属性搜索class元素？以下似乎对我有用：

from bs4 import BeautifulSoup

html = '''<div class = "quote">
<div class = "whatever"> 
some unnecessary text here 
</div>
<div class = "text">
Here's the desired text!
</div>
</div>'''


link = BeautifulSoup(html, 'html')
print link.find('div', class_="text").text.strip()

它产生：

Here's the desired text!

如何使用BeautifulSoup正确获取元素？

1 个答案: