我对lxml很新,我有一个包含40,000行数据的xml文件。我已经阅读了lxml的教程,但我不确定哪个buildin函数最适合实现我的目标,即从某个元素中提取文本,取决于元素的存在。 xml文件的结构如下:
<main>
<header>
<row type="info">
<field name="description"><![CDATA[Results]]></field>
<field name="created"><![CDATA[6/12/2014 6:45:00 PM]]></field>
</row>
<row>
<field name="profile"><![CDATA[Intel]]></field>
</row>
</header>
<sections>
<section name="Results">
<description />
<parameters />
<header />
<content>
<row>
# A row-dependend number of fields exist before the Full Content field
<field name="Full Content"><![CDATA[ I am the text of interest]]></field>
# A row-dependend number of fields follow here
</row>
# There are 40,000 of these row elements
</content>
<footer>
<row type="content_count">
<field name="count"><![CDATA[9981]]></field>
</row>
</footer>
</section>
</sections>
</main>
我希望从40,000行中的每一行中的字段中提取文本,并将数据存储在字典中。
我不确定如何使用字段名称遍历xml树(这似乎是从子元素的索引从行更改为行的方式。
目前,我将这些xml文件存储在我创建的列表中,如下所示:
files = get_files("P:\\Data\\files")
xmls = []
for file in files:
parser = etree.XMLParser(ns_clean=True, recover = True)
tree = etree.parse(file, parser=parser)
root = tree.getroot()
xmls.append(root)
感谢您的任何建议 的Matthias
答案 0 :(得分:0)
简单示例如何使用lxml.cssselect()
import lxml
import lxml.html
data = ''' <main>
<header>
<row type="info">
<field name="description"><![CDATA[Results]]></field>
<field name="created"><![CDATA[6/12/2014 6:45:00 PM]]></field>
</row>
<row>
<field name="profile"><![CDATA[Intel]]></field>
</row>
</header>
<sections>
<section name="Results">
<description />
<parameters />
<header />
<content>
<row>
# A row-dependend number of fields exist before the Full Content field
<field name="Full Content"><![CDATA[ I am the text of interest]]></field>
# A row-dependend number of fields follow here
</row>
# There are 40,000 of these row elements
</content>
<footer>
<row type="content_count">
<field name="count"><![CDATA[9981]]></field>
</row>
</footer>
</section>
</sections>
</main>'''
html = lxml.html.fromstring(data)
fields = html.cssselect('field')
for x in fields:
print lxml.etree.tostring(x)
<field name="description"/>
<field name="created"/>
<field name="profile"/>
<field name="Full Content"/>
# A row-dependend number of fields follow here
<field name="count"/>
编辑:
获取DATA []
中文本的版本import lxml
import lxml.etree
data = ''' ... html_as_previouse ... '''
et = lxml.etree.fromstring(data)
fields = et.xpath('//field')
for x in fields:
print x.text
Results
6/12/2014 6:45:00 PM
Intel
I am the text of interest
9981
编辑:
按名称查找字段
named_fields = et.xpath('//field[@name="count"]')
for x in named_fields:
print x.text