Question

我有一些html文档，我想从中提取一个非常特殊的文本。现在，此文本始终位于

<div class = "fix">text </div>

现在，有时会发生什么......还有其他开放的div ...类似于：

 <div class = "fix"> part of text <div something> other text </div> some more text </div>

现在..我想提取与

对应的所有文字

 <div class = "fix">                     </div> markups??

我该怎么做？

Answer 1

我会使用BeautifulSoup库。他们有点为此而建，只要你的数据是正确的html，它应该找到你正在寻找的东西。他们有相当好的文档，而且非常直接，即使对于初学者也是如此。如果您的文件位于无法访问直接html的网站上，请使用urllib抓取html。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find({"class":"fix"})

如果有多个项目，请使用find_all。这应该会给你你正在寻找的东西（粗略地）。

编辑：修复示例（类是关键字，所以你不能使用通常的（attr =“blah”）

Answer 2

这是一个非常简单的解决方案，使用非贪婪的正则表达式删除所有html标记。：

import re
s =  "<div class = \"fix\"> part of text <div something> other text </div> some more text </div>"
s_text = re.sub(r'<.*?>', '', s)

然后是值：

print(s)
<div class = "fix"> part of text <div something> other text </div> some more text </div>
print(s_text)
 part of text  other text  some more text

从嘈杂的字符串中提取文本.. python

2 个答案: