I have a Python dictionary with each key being a German word, and the values being lists of grammatical information (suffixes and occurrence counts) e.g. :
example_dict = {
Abend:['@Ø@', '5866@', '@s@', '5@'],
Spieler:['@Ø@', '1075@'],
Schlacht:['@en@', '336@', '@Ø@', '5275@']
}
The items in the list can be of arbitrary number, though they always follow the pattern:
['@suffix@', 'count@', ...]
I also have a lexicon in XML format (which would have to be read into Python) that contains information about the words, including inflectional class:
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="lexicon-transform.xslt"?>
<smor>
<BaseStem>
<Lemma>Abend</Lemma>
<Stem>Abend</Stem>
<Pos>NN</Pos>
<Origin>nativ</Origin>
<InfClass>NMasc_s_e</InfClass>
</BaseStem>
<BaseStem>
<Lemma>Abend</Lemma>
<Stem>Abend</Stem>
<Pos>NPROP</Pos>
<Origin>nativ</Origin>
<InfClass>FamName_s</InfClass>
</BaseStem>
<BaseStem>
<Lemma>Abendschule</Lemma>
<Stem>Abendschule</Stem>
<Pos>NN</Pos>
<Origin>nativ</Origin>
<InfClass>NFem_0_n</InfClass>
</BaseStem>
<BaseStem>
<Lemma>3D</Lemma>
<Stem>3D</Stem>
<Pos>ABBR</Pos>
<Origin>nativ</Origin>
<InfClass>Abk_ADV</InfClass>
</BaseStem>
</smor>
What I would like to do is match my dictionary keys to their corresponding entries in the XML lexicon (if an entry in the XML exists), which are indicated by the lemma tags:
<Lemma>Word</Lemma>
And if the matching word is also a noun, as indicated by the part-of-speech tag:
<Pos>NN</Pos>
Extract the inflectional class of that noun and append it to the appropriate list in the dictionary values (preferably removing any dictionary entry that does not have a match in the XML at the same time), e.g.:
new_dict = {...,Abend:['@Ø@', '5866@', '@s@', '5@', 'NMasc_s_e'],...}
From the examples above, "Spieler" and "Schlacht" would be removed from new_dict as "Abend" is the only noun with a match in both example_dict and the XML.
I know the problem as I've described calls for some for-loops, but I lack the experience with XML in general, and with the associated Python libraries for XML, to approach this intelligently; so I appreciate any help.
答案 0 :(得分:1)
I'm not quite sure I understand your xml document, but if your just looking for find all instances of the <Lemma>
entities, you can do this
# Let's assume the document string is in docstring
import xml.etree.ElementTree as ET
docxml = ET.fromstring(docstring)
for node in docxml.findall("BaseStem/Lemma"):
print node.text
答案 1 :(得分:1)
For xml library you can try with lxml.etree
http://lxml.de/tutorial.html
first you need to create an xml root element from given string/file
tree = etree.fromstring(xml_str)
To find multiple nodes you can use
base_stems = tree.findall('BaseStems')
To find a node value you can use
lemma = base_stem.findtext('Lemma')
To check if a key exists in the dict
example_dict.get(lemma)
Hope this will help you to implement what you want
答案 2 :(得分:1)
from lxml import etree
xml_dict = etree.parse('/path/to/xml_dict_path.xml')
for lemma, properties in example_dict.iteritems():
inf_class = xml.find("//BaseStem[Lemma = '%s' and Pos = 'NN']/InfClass" % lemma)
if len(inf_class):
properties.append(inf_class[0])
You can cache the result of xml.find("//BaseStem[Pos = 'NN']")
by Lemma
as key and InfClass
as value if the repeated XPath lookups in the loop turn out to be slow on your amount of data.