Question

给出以下代码：

<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>

如何使用BeautifulSoup从test中提取单词<div class="category5"> test，即如何处理嵌套的div？我试图在互联网上查找，但我没有找到任何处理易于掌握的例子，所以我设置了这个。感谢。

Answer 1

xpath 应该是直接的答案，但BeautifulSoup不支持此功能。

更新：使用BeautifulSoup解决方案

为此，假设您知道类和元素（ div ），则可以使用for/loop attrs得到你想要的东西：

from bs4 import BeautifulSoup

html = '''
<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>'''

content = BeautifulSoup(html)

for div in content.findAll('div', attrs={'class':'category5'}):
    print div.text

test

我从你的html示例中提取文本没有问题，比如@MartijnPieters建议，你需要找出你的 div 元素缺失的原因。

另一次更新

由于您缺少lxml作为BeautifulSoup的解析器，这就是为什么返回None，因为您尚未解析任何内容。安装lxml应解决您的问题。

如果你问我，你可以考虑使用支持 xpath 的lxml或类似代码，这很容易。

from lxml import etree

tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[@class="category5"]/text()')
[' test\n                 ']

BeautifulSoup：如何获得嵌套的div

1 个答案:

更新：使用BeautifulSoup解决方案

另一次更新