Python: Appending values to existing dictionary entries by searching an XML tree

时间:2015-07-28 15:56:43

标签: python xml python-2.7 dictionary xml-parsing

I have a Python dictionary with each key being a German word, and the values being lists of grammatical information (suffixes and occurrence counts) e.g. :

example_dict = {
                Abend:['@Ø@', '5866@', '@s@', '5@'],
                Spieler:['@Ø@', '1075@'],
                Schlacht:['@en@', '336@', '@Ø@', '5275@']
               }

The items in the list can be of arbitrary number, though they always follow the pattern:

['@suffix@', 'count@', ...]

I also have a lexicon in XML format (which would have to be read into Python) that contains information about the words, including inflectional class:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="lexicon-transform.xslt"?>
    <smor>
        <BaseStem>
            <Lemma>Abend</Lemma>
            <Stem>Abend</Stem>
            <Pos>NN</Pos>
            <Origin>nativ</Origin>
            <InfClass>NMasc_s_e</InfClass>
        </BaseStem>
        <BaseStem>
            <Lemma>Abend</Lemma>
            <Stem>Abend</Stem>
            <Pos>NPROP</Pos>
            <Origin>nativ</Origin>
            <InfClass>FamName_s</InfClass>
        </BaseStem>
        <BaseStem>
            <Lemma>Abendschule</Lemma>
            <Stem>Abendschule</Stem>
            <Pos>NN</Pos>
            <Origin>nativ</Origin>
            <InfClass>NFem_0_n</InfClass>
        </BaseStem>
        <BaseStem>
            <Lemma>3D</Lemma>
            <Stem>3D</Stem>
            <Pos>ABBR</Pos>
            <Origin>nativ</Origin>
            <InfClass>Abk_ADV</InfClass>
        </BaseStem>
    </smor>

What I would like to do is match my dictionary keys to their corresponding entries in the XML lexicon (if an entry in the XML exists), which are indicated by the lemma tags:

<Lemma>Word</Lemma>

And if the matching word is also a noun, as indicated by the part-of-speech tag:

<Pos>NN</Pos>

Extract the inflectional class of that noun and append it to the appropriate list in the dictionary values (preferably removing any dictionary entry that does not have a match in the XML at the same time), e.g.:

new_dict = {...,Abend:['@Ø@', '5866@', '@s@', '5@', 'NMasc_s_e'],...}

From the examples above, "Spieler" and "Schlacht" would be removed from new_dict as "Abend" is the only noun with a match in both example_dict and the XML.

I know the problem as I've described calls for some for-loops, but I lack the experience with XML in general, and with the associated Python libraries for XML, to approach this intelligently; so I appreciate any help.

3 个答案:

答案 0 :(得分:1)

I'm not quite sure I understand your xml document, but if your just looking for find all instances of the <Lemma> entities, you can do this

# Let's assume the document string is in docstring

import xml.etree.ElementTree as ET
docxml = ET.fromstring(docstring)
for node in docxml.findall("BaseStem/Lemma"):
    print node.text

答案 1 :(得分:1)

For xml library you can try with lxml.etree http://lxml.de/tutorial.html

first you need to create an xml root element from given string/file

tree = etree.fromstring(xml_str)

To find multiple nodes you can use

base_stems = tree.findall('BaseStems')

To find a node value you can use

lemma = base_stem.findtext('Lemma')

To check if a key exists in the dict

example_dict.get(lemma)

Hope this will help you to implement what you want

答案 2 :(得分:1)

from lxml import etree

xml_dict = etree.parse('/path/to/xml_dict_path.xml')

for lemma, properties in example_dict.iteritems():
    inf_class = xml.find("//BaseStem[Lemma = '%s' and Pos = 'NN']/InfClass" % lemma)
    if len(inf_class):
        properties.append(inf_class[0])

You can cache the result of xml.find("//BaseStem[Pos = 'NN']") by Lemma as key and InfClass as value if the repeated XPath lookups in the loop turn out to be slow on your amount of data.