为什么Beautiful soup会为文档添加额外的xml声明以及如何删除它?

时间:2016-01-12 12:15:03

标签: python xml beautifulsoup

我试图解析一个有标题的简单xml。这是代码:

 v1 <- c("nb-008",  "nb-014",  "na015",   "na-018",  
            "ta-008",   "tc-014",  "ta-015", "ta-018" ) 
set.seed(24)
data <- setNames(as.data.frame(matrix(sample(0:8, 8*5, 
               replace=TRUE), ncol=8)), v1)

输出正在填充:

str(BeautifulSoup("""
<?xml version="1.0" encoding="UTF-8"?>
<data/>
""", features='xml'))

正如我们所看到的那样,还有额外的标题,而且格式也不正确。这是一个错误还是我做错了什么?

版本:

<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="UTF-8"><data/>

2 个答案:

答案 0 :(得分:1)

当您将xml传递给features参数时,lxml会自行构建xml树。因此,您不需要自己设置标题。

>>> str(BeautifulSoup("""
... <data/>
... """, features='xml'))
'<?xml version="1.0" encoding="utf-8"?>\n<data/>'

>>>

答案 1 :(得分:0)

  

这是一个错误还是我做错了什么?

简短回答是的,你做错了。

如何?

您获得两个XML声明的原因是您将Beautiful Soup使用的features参数传递给build the tree

if builder is None:
    if isinstance(features, basestring):
        features = [features]
    if features is None or len(features) == 0:
        features = self.DEFAULT_BUILDER_FEATURES
    builder_class = builder_registry.lookup(*features)
    if builder_class is None:
    raise FeatureNotFound(
            "Couldn't find a tree builder with the features you "
            "requested: %s. Do you need to install a parser library?"
            % ",".join(features))
    builder = builder_class()
self.builder = builder
self.is_xml = builder.is_xml
self.builder.soup = self

但这不是历史。 self.is_xml用于.decode(),它返回文档的字符串或Unicode表示形式,当self.is_xml真实时adds an XML declaration to the tree.

if self.is_xml:
    # Print the XML declaration
    encoding_part = ''
    if eventual_encoding != None:
        encoding_part = ' encoding="%s"' % eventual_encoding
    prefix = u'<?xml version="1.0"%s?>\n' % encoding_part
    ...

最后,您最终会得到两个 XML声明

如何解决此问题?

您需要将{xml'的解析器作为BeautifulSoup构造函数的第二个参数传递,如the documentation中所述。

>>> from bs4 import BeautifulSoup
>>> doc = '''<?xml version="1.0" encoding="UTF-8"?>
... <data/>'''
>>> soup = BeautifulSoup(doc, 'xml')
>>> str(soup)
'<?xml version="1.0" encoding="utf-8"?>\n<data/>'