合并Xml中的单词

时间:2014-11-09 09:33:29

标签: python xml python-2.7 lxml findall

在以下xml中:

<w:body>
    <w:p w:rsidR="00912B30" w:rsidRPr="00912B30" w:rsidRDefault="00912B30" w:rsidP="00912B30">
        <w:pPr>
            <w:autoSpaceDE w:val="0"/>
            <w:autoSpaceDN w:val="0"/>
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            </w:rPr>
        </w:pPr>
        <w:r w:rsidRPr="00912B30">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            </w:rPr>
            <w:t xml:space="preserve">Considering those situations, after 1970 The </w:t>
        </w:r>
        <w:r w:rsidRPr="00E155EC">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:strike/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            </w:rPr>
            <w:t>Agricultural Land Law</w:t>
        </w:r>
        <w:r w:rsidRPr="00912B30">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            </w:rPr>
            <w:t xml:space="preserve"> of 1952 was modified and changed the principle to permit renting and lending agricultural land. The way of thinking was as follows. If it was difficult to widen farmers’ size by buying agricultural land, expanding the size by renting would be possible. After that some positive framework to promote renting and lending agricultural land. For example, The </w:t>
        </w:r>
        <w:r w:rsidRPr="00E155EC">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:strike/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            </w:rPr>
            <w:t>Agricultural Land Use Promotion Project</w:t>
        </w:r>
        <w:r w:rsidRPr="00912B30">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            </w:rPr>
            <w:t xml:space="preserve"> had started in 1975 and The </w:t>
        </w:r>
        <w:r w:rsidRPr="00E155EC">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:strike/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            </w:rPr>
            <w:t>Agricultural Land Use Promotion Law</w:t>
        </w:r>
        <w:r w:rsidRPr="00912B30">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            </w:rPr>
            <w:t xml:space="preserve"> was established in 1980. Actually after that, area of agricultural land by transfer of ownership of owned agricultural land with compensation had been more than the area by transfer of rights for </w:t>
        </w:r>
        <w:r w:rsidRPr="00912B30">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            </w:rPr>
            <w:lastRenderedPageBreak/>
            <w:t>lease.</w:t>
        </w:r>
    </w:p>
</w:body>  

我需要提取标记为<w:strike>的所有文本

w = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'

问题是被击打的单词不是连续的,它们处于任意位置。当我提取并加入它们时,前一个strike实例的最后一个单词与下一个strike实例的第一个单词合并。

我的方法:

 text = ""  #initialize empty string where all words will be stored
    source = etree.parse(doc_xml)
    for p in source.findall('.//'+w1+'p'): #iterate over every p tag
        text+= " "      # add a space to separate words in successive paragraphs
        for b in p.findall('.//{%(ns)s}strike/../..//{%(ns)s}t' %{'ns':w}):
            text+=''.join(b.text) #joins all strike text and appends to empty string

输出

text =" Agricultural Land LawAgricultural Land Use Promotion ProjectAgricultural Land Use Promotion Law"

预期输出:

text = " Agricultural Land Law Agricultural Land Use Promotion Project Agricultural Land Use Promotion Law" 

原油修复: 用以下代码替换最后一行代码:

text+=" " +''.join(b.text)

它解决了上述问题,但在很多情况下,单个单词属于2个攻击实例,因此粗略修复可能会输出"he lp"而不是"help"。这有点棘手,我想到了:
1.提取罢工文本
2.检查下一个文本标签。如果它没有标记标记,请在文本中添加空格,如果它有标记标记,则直接加入。

以下是关于在不同实例中出现的单词的示例:

<w:r w:rsidRPr:00C42D65>
   <w:rpr>
     <w:strike>
   <w:t>(IQR
<w:r w:rsidRPr:00C42D65>
  <w:rpr>
     <w:strike>
  <w:t>)

结束括号与(IQR相关联,但由于原始方法,它变为
(IQR )

更新

这是我试过的新功能,但我想我的xpath语法不正确:

text=" "
for p in source.xpath('.//w:p.//w:r',namespaces={'w': w}): #iterate over each run instance
    for q in p.xpath('.//w:t',namespaces={'w': w}):  #check for text
        if q.xpath('/..//w:strike',namespaces={'w': w}):     #if it has strike tag
           text+=''.join(q.text)    #concatenate text
        else:
           text += " "         #else add a space if text has no strike tag  

看起来问题出在if语句xpath表达式中。

1 个答案:

答案 0 :(得分:1)

如何将xpathnamespaces一起使用,并将结果(字符串列表)加入' '.join(..)

...

source = etree.parse(doc_xml)
text = ' '.join(
    source.xpath('.//w:p//w:strike/../..//w:t/text()', namespaces={'w': w})
)

<强>更新

text = ''
for t in source.xpath('.//w:p//w:r//w:t',namespaces={'w': w}):
    if t.xpath('..//w:strike',namespaces={'w': w}):
        text += t.text
    else:
        if text:  # To prevent space before the first text.
            text += ' '