我试图解析一个有标题的简单xml。这是代码:
v1 <- c("nb-008", "nb-014", "na015", "na-018",
"ta-008", "tc-014", "ta-015", "ta-018" )
set.seed(24)
data <- setNames(as.data.frame(matrix(sample(0:8, 8*5,
replace=TRUE), ncol=8)), v1)
输出正在填充:
str(BeautifulSoup("""
<?xml version="1.0" encoding="UTF-8"?>
<data/>
""", features='xml'))
正如我们所看到的那样,还有额外的标题,而且格式也不正确。这是一个错误还是我做错了什么?
版本:
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="UTF-8"><data/>
答案 0 :(得分:1)
当您将xml
传递给features
参数时,lxml
会自行构建xml树。因此,您不需要自己设置标题。
>>> str(BeautifulSoup("""
... <data/>
... """, features='xml'))
'<?xml version="1.0" encoding="utf-8"?>\n<data/>'
>>>
答案 1 :(得分:0)
这是一个错误还是我做错了什么?
简短回答是的,你做错了。
您获得两个XML声明的原因是您将Beautiful Soup使用的features
参数传递给build the tree。
if builder is None:
if isinstance(features, basestring):
features = [features]
if features is None or len(features) == 0:
features = self.DEFAULT_BUILDER_FEATURES
builder_class = builder_registry.lookup(*features)
if builder_class is None:
raise FeatureNotFound(
"Couldn't find a tree builder with the features you "
"requested: %s. Do you need to install a parser library?"
% ",".join(features))
builder = builder_class()
self.builder = builder
self.is_xml = builder.is_xml
self.builder.soup = self
但这不是历史。 self.is_xml
用于.decode()
,它返回文档的字符串或Unicode表示形式,当self.is_xml
真实时adds an XML declaration to the tree.
if self.is_xml:
# Print the XML declaration
encoding_part = ''
if eventual_encoding != None:
encoding_part = ' encoding="%s"' % eventual_encoding
prefix = u'<?xml version="1.0"%s?>\n' % encoding_part
...
最后,您最终会得到两个 XML声明。
您需要将{xml'的解析器作为BeautifulSoup
构造函数的第二个参数传递,如the documentation中所述。
>>> from bs4 import BeautifulSoup
>>> doc = '''<?xml version="1.0" encoding="UTF-8"?>
... <data/>'''
>>> soup = BeautifulSoup(doc, 'xml')
>>> str(soup)
'<?xml version="1.0" encoding="utf-8"?>\n<data/>'