我正在尝试使用Collections.counter和lxml来计算xml文件中的所有名词和形容词,其中名词和形容词标签的写法如下:
<word id="2" start="7400" end="7411" tag="NN">Ministerien</word>
标签=&#34; NN&#34;表示名词。我想只将它们拉出来,并将它们添加到柜台,但我很难这样做。我现在可以拿出所有单词并对它们进行计数,但是在lxml中找不到某种方法只能获取某些子标签。
这是当前的相关代码。
context = etree.iterparse(xmlFile)
counter = collections.Counter()
for action, elem in context:
if elem.tag == "word":
counter[elem.text] += 1
print counter.most_common(10)
答案 0 :(得分:0)
elem.attrib
返回key:value对的字典,其中键是该xml元素的属性/属性,value是该特定属性的值。
您可以使用它并检查属性tag
是名词还是形容词。
示例 -
context = etree.iterparse(xmlFile)
counter = collections.Counter()
for action, elem in context:
if elem.tag == "word" and (elem.attrib.get('tag') in ['NN','AD']): #AD I just used for adjective, use whatever is correct.
counter[elem.text] += 1
print counter.most_common(10)
示例/演示 -
我有一个像< - p>这样的xml文件
<root>
<word id="2" start="7400" end="7411" tag="NN">Ministerien1</word>
<word id="2" start="7400" end="7411" tag="AD">Ministerien5</word>
<word id="2" start="7400" end="7411" tag="NN">Ministerien2</word>
<word id="2" start="7400" end="7411" tag="NN">Ministerien1</word>
<word id="2" start="7400" end="7411" tag="AD">Ministerien4</word>
<word id="2" start="7400" end="7411" tag="AD">Ministerien4</word>
<word id="2" start="7400" end="7411" tag="HF">Ministerien6</word>
<word id="2" start="7400" end="7411" tag="AD">Ministerien4</word>
<word id="2" start="7400" end="7411" tag="NN">Ministerien2</word>
<word id="2" start="7400" end="7411" tag="NN">Ministerien3</word>
</root>
代码 -
In [18]: context = etree.iterparse('a.xml')
In [19]: counter = Counter()
In [20]: for action, elem in context:
....: if elem.tag == "word" and (elem.attrib.get('tag') in ['NN','AD']):
....: counter[elem.text] += 1
In [21]: counter
Out[21]: Counter({'Ministerien4': 3, 'Ministerien2': 2, 'Ministerien1': 2, 'Ministerien3': 1, 'Ministerien5': 1})