Question

鉴于一些HTML代码，如何删除所有标签，保留文本和img以及标签？例如，我有

<div><script bla bla></script><p>Hello all <a href ="xx"></a> <img rscr="xx"></img></p></div>

我想保留

Hello to <a href ="xx"></a> <img rscr="xx"></img>

BeautifulSoup或Python中是否有实现的内容？

由于

Answer 1

('Hello all  ', <a href="xx"></a>, <img rscr="xx"/>)

out：

soup.div.text, soup.div.a, soup.div.img

当下一个元素是标记的后代时，有一个快捷方式：

('Hello all  ', <a href="xx"></a>, <img rscr="xx"/>)

out：< / p>

find_next

当您使用bs4的解析器时，'img'标签将是自闭标签
您可以使用File fromFile = new File(context.getDatabasePath("database.db").getPath()); FileChannel fromFileChannel = new FileInputStream(fromFile).getChannel(); File toFile = new File(context.getFilesDir() + "/database.db"); if (toFile.getParentFile() != null) toFile.getParentFile().mkdirs(); FileChannel toFileChannel = new FileOutputStream(toFile).getChannel(); Log.i("LOG",fromFileChannel.transferTo(0, fromFileChannel.size(), toFileChannel)+""); fromFileChannel.close(); toFileChannel.close();来获取DOM中的下一个元素

Answer 2

您可以通过访问.descendants property来选择所有后代节点。

从那里，您可以迭代所有后代并根据name属性过滤它们。如果节点没有name属性，那么它可能是您要保留的文本节点。如果name属性为a或img，那么您也可以保留它。

# This should be the wrapper that you are targeting
container = soup.find('div')
keep = []

for node in container.descendants:
  if not node.name or node.name == 'a' or node.name == 'img':
    keep.append(node)

这是一个替代方案，其中所有过滤的元素都用于直接创建列表：

# This should be the wrapper that you are targeting
container = soup.find('div')

keep = [node for node in container.descendants
        if not node.name or node.name == 'a' or node.name == 'img']

另外，如果你不想返回空字符串，你可以修剪空格并检查它：

keep = [node for node in container.descendants
        if (not node.name and len(node.strip())) or
           (node.name == 'a' or node.name == 'img')]

根据您提供的HTML，将返回以下内容：

> ['Hello all ', <a href="xx"></a>, <img rscr="xx"/>]

BeautifulSoup删除除白名单中的所有html标签，例如“img”和带有python的“a”标签

2 个答案: