使用lxml通过unicode编码检索标记和子标记类别

时间:2015-07-30 14:41:08

标签: python xml unicode lxml

我正在尝试使用Collections.counter和lxml来计算xml文件中的所有名词和形容词,其中名词和形容词标签的写法如下:

<word id="2" start="7400" end="7411" tag="NN">Ministerien</word>

标签=&#34; NN&#34;表示名词。我想只将它们拉出来,并将它们添加到柜台,但我很难这样做。我现在可以拿出所有单词并对它们进行计数,但是在lxml中找不到某种方法只能获取某些子标签。

这是当前的相关代码。

context = etree.iterparse(xmlFile)
counter = collections.Counter()
for action, elem in context:
    if elem.tag == "word":
      counter[elem.text] += 1
print counter.most_common(10)

1 个答案:

答案 0 :(得分:0)

elem.attrib返回key:value对的字典,其中键是该xml元素的属性/属性,value是该特定属性的值。

您可以使用它并检查属性tag是名词还是形容词。

示例 -

context = etree.iterparse(xmlFile)
counter = collections.Counter()
for action, elem in context:
    if elem.tag == "word" and (elem.attrib.get('tag') in ['NN','AD']): #AD I just used for adjective, use whatever is correct.
      counter[elem.text] += 1
print counter.most_common(10)

示例/演示 -

我有一个像< - p>这样的xml文件

<root>
<word id="2" start="7400" end="7411" tag="NN">Ministerien1</word>
<word id="2" start="7400" end="7411" tag="AD">Ministerien5</word>
<word id="2" start="7400" end="7411" tag="NN">Ministerien2</word>
<word id="2" start="7400" end="7411" tag="NN">Ministerien1</word>
<word id="2" start="7400" end="7411" tag="AD">Ministerien4</word>
<word id="2" start="7400" end="7411" tag="AD">Ministerien4</word>
<word id="2" start="7400" end="7411" tag="HF">Ministerien6</word>
<word id="2" start="7400" end="7411" tag="AD">Ministerien4</word>
<word id="2" start="7400" end="7411" tag="NN">Ministerien2</word>
<word id="2" start="7400" end="7411" tag="NN">Ministerien3</word>
</root>

代码 -

In [18]: context = etree.iterparse('a.xml')

In [19]: counter = Counter()

In [20]: for action, elem in context:
   ....:     if elem.tag == "word" and (elem.attrib.get('tag') in ['NN','AD']):
   ....:         counter[elem.text] += 1

In [21]: counter
Out[21]: Counter({'Ministerien4': 3, 'Ministerien2': 2, 'Ministerien1': 2, 'Ministerien3': 1, 'Ministerien5': 1})