Question

我有一个xml数据集标签，格式如下：

<catchphrase "id=c0">unconscionable conduct</catchphrase>

我认为当他们制作数据集时，他们并没有像以前那样格式化id属性：

<catchphrase id="c0">unconscionable conduct</catchphrase>

然而，当它通过python中的Beautiful Soap lib时，它出现如下：

 soup = BeautifulSoup(content, 'xml')

结果

 <catchphrase>
   "id=c0"&gt;application for leave to appeal
  </catchphrase>

或

soup = BeautifulSoup(content, 'lxml')

结果

<html>
   <body>
    ...
     <catchphrase>
         application for leave to appeal
     </catchphrase>
    ....

我想看起来像第二个但没有html和body标签（这是一个XML文档）。我不需要id属性。在将其写入文件之前我也使用soup.prettify('utf-8')，但我认为当我这样做时它已经被错误地格式化了。

Answer 1

没有这样的标准方法，但你可以做的是用正确的方法替换故障部分，如下所示：

from bs4 import BeautifulSoup
content = '<catchphrase "id=c0">unconscionable conduct</catchphrase>'

soup = BeautifulSoup(content.replace('"id=', 'id="'), 'xml')
print soup

这导致：

<catchphrase id="c0">unconscionable conduct</catchphrase>

这肯定是一个黑客，因为没有标准的方法来处理这个问题，主要是因为在BeautifulSoup解析之前，XML应该是正确的。

在python中美丽的汤xml格式

1 个答案: