Question

我认为这个问题可能是一个微不足道的问题，我已经花了好几天在这个和各种网站上查看类似的主题和问题，以及阅读lxml文档，但我仍然无法弄清楚我是什么我做错了，所以一些帮助会非常感激。

因此，我有大量的XML文件，我只想从中提取文本，并将其写入.txt文件中。我设法做得很好。每个XML文件都包含标签，其中包含有关文档的各种信息，从中获取的信息，单词的数量等。我认为这是不需要的信息，我不想在最终的.txt文件中提取它。所以，我想要所有文本，除了每个XML文件的第一行的任何标记中存在的文本。问题是，当我尝试使用strip_elements函数时，我理解这是解决方案，我得到一个TypeError，说'Type'NoneType'无法序列化'...

以下是其中一个XML文件的示例（前十行）（这是因版权问题而匿名，但结构保持不变）：

<bncDoc xml:id="K4E"><teiHeader><fileDesc><titleStmt><title>  This is the document title </title><respStmt><resp> Data capture and transcription </resp><name> University Press </name> </respStmt></titleStmt><editionStmt><edition>LNC XML Edition, December 2000</edition></editionStmt><extent> 21165 tokens; 21456 w-units; 1265 s-units </extent><publicationStmt><distributor>This is the legal terms of use</distributor><availability> This material is protected by international copyright laws and may not be copied or redistributed in any way. </availability><idno type="bnc">K4E</idno><idno type="old"> LgWldA </idno></publicationStmt><sourceDesc><bibl><title>Daily Post and Echo: Foreign news pages.</title> <imprint><publisher>u.p.</publisher> </imprint> </bibl></sourceDesc></fileDesc><encodingDesc><tagsDecl><namespace name=""><tagUsage gi="c" occurs="2618"/><tagUsage gi="div" occurs="129"/><tagUsage gi="gap" occurs="6"/><tagUsage gi="head" occurs="165"/><tagUsage gi="mw" occurs="106"/><tagUsage gi="p" occurs="923"/><tagUsage gi="s" occurs="1265"/><tagUsage gi="w" occurs="21456"/></namespace></tagsDecl></encodingDesc><profileDesc><creation date="0000">0000-00-00 Origination/creation date not known </creation><textClass><catRef targets="WRI ALLTIM3 ALLAVA0 ALLTYP3 WRIAAG0 WRIAD0 WRIASE0 WRIATY2 WRIAUD4 WRIDOM5 WRILEV2 WRIMED2 WRIPP0 WRISAM0 WRISTA0 WRITAS3"/><classCode scheme="DLEE">W newsp other: report</classCode><keywords><term> (none) </term></keywords></textClass></profileDesc><revisionDesc><change date="2006-10-21" who="#OUCS">Tag usage updated for LNC-XML</change><change date="2000-12-13" who="#OUCS">Last check for LNC World first release</change><change date="2000-09-01" who="#OUCS">Check all tagcounts</change><change date="2000-06-23" who="#OUCS">Resequenced s-units and added headers</change><change date="2000-01-21" who="#OUCS">Added date info</change><change date="2000-01-09" who="#OUCS">Updated all catrefs</change><change date="2000-01-08" who="#OUCS">Updated source title</change><change date="2000-01-08" who="#OUCS">Updated titles</change><change date="1999-12-25" who="#OUCS">corrected tagUsage</change><change date="1999-09-16" who="#UCREL">POS codes revised for LNC-2; header updated</change><change date="1994-11-26" who="#dominic">Initial accession to corpus</change></revisionDesc></teiHeader>
<wtext type="NEWS"><div level="1"><head type="MAIN">
<s n="1"><w c5="AJ0" hw="new" pos="ADJ">New </w><w c5="NN1" hw="plea" pos="SUBST">plea </w><w c5="PRP" hw="for" pos="PREP">for </w><w c5="NN2" hw="redundancy" pos="SUBST">redundancies</w></s></head><head type="BYLINE">
<s n="2"><w c5="PRP" hw="by" pos="PREP">By </w><w c5="NP0" hw="daily" pos="SUBST">Daily </w><w c5="NN1" hw="post" pos="SUBST">Post </w><w c5="NN1" hw="correspondent" pos="SUBST">Correspondent</w></s></head><p>
<s n="3"><w c5="AJ0-NN1" hw="cash-starved" pos="ADJ">CASH-starved </w><w c5="NP0" hw="clwyd" pos="SUBST">Clwyd </w><w c5="NN1" hw="county" pos="SUBST">county </w><w c5="NN1" hw="council" pos="SUBST">council </w><w c5="VHZ" hw="have" pos="VERB">has </w><w c5="VVN" hw="make" pos="VERB">made </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="AJ0" hw="fresh" pos="ADJ">fresh </w><w c5="NN1" hw="appeal" pos="SUBST">appeal </w><w c5="PRP" hw="to" pos="PREP">to </w><w c5="DPS" hw="it" pos="PRON">its </w><w c5="NN0" hw="staff" pos="SUBST">staff </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="volunteer" pos="VERB">volunteer </w><w c5="PRP" hw="for" pos="PREP">for </w><w c5="NN1" hw="redundancy" pos="SUBST">redundancy </w><w c5="CJC" hw="or" pos="CONJ">or </w><w c5="VVB" hw="take" pos="VERB">take </w><w c5="AJ0" hw="early" pos="ADJ">early </w><w c5="NN1" hw="retirement" pos="SUBST">retirement</w><c c5="PUN">.</c></s></p><p>
<s n="4"><w c5="PNP" hw="it" pos="PRON">It </w><w c5="VHZ" hw="have" pos="VERB">has </w><w c5="VVN" hw="give" pos="VERB">given </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="NN1" hw="pledge" pos="SUBST">pledge </w><w c5="CJT-DT0" hw="that" pos="CONJ">that </w><w c5="AJ0" hw="compulsory" pos="ADJ">compulsory </w><w c5="NN2" hw="redundancy" pos="SUBST">redundancies </w><w c5="VM0" hw="will" pos="VERB">will </w><w c5="XX0" hw="not" pos="ADV">not </w><w c5="VBI" hw="be" pos="VERB">be </w><w c5="VVN" hw="introduce" pos="VERB">introduced </w><mw c5="PRP"><w c5="AVP" hw="up" pos="ADV">up </w><w c5="PRP" hw="to" pos="PREP">to </w></mw><w c5="NP0" hw="august" pos="SUBST">August </w><w c5="CRD" hw="31" pos="ADJ">31 </w><w c5="PRP" hw="in" pos="PREP">in </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="hope" pos="SUBST">hope </w><w c5="CJT" hw="that" pos="CONJ">that </w><w c5="DT0" hw="enough" pos="ADJ">enough </w><w c5="NN2" hw="volunteer" pos="SUBST">volunteers </w><w c5="VVB" hw="come" pos="VERB">come </w><w c5="AV0" hw="forward" pos="ADV">forward</w><c c5="PUN">.</c></s></p><p>
<s n="5"><w c5="CJC" hw="and" pos="CONJ">And </w><w c5="PNP" hw="it" pos="PRON">it </w><w c5="VVZ" hw="say" pos="VERB">says </w><w c5="CJT" hw="that" pos="CONJ">that </w><w c5="AJ0" hw="compulsory" pos="ADJ">compulsory </w><w c5="NN1" hw="redundancy" pos="SUBST">redundancy </w><w c5="VM0" hw="will" pos="VERB">will </w><w c5="AV0" hw="only" pos="ADV">only </w><w c5="VBI" hw="be" pos="VERB">be </w><w c5="VVN" hw="use" pos="VERB">used </w><w c5="PRP" hw="as" pos="PREP">as </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="ORD" hw="last" pos="ADJ">last </w><w c5="NN1" hw="resort" pos="SUBST">resort</w><c c5="PUN">.</c></s></p><p>
<s n="6"><w c5="AT0" hw="the" pos="ART">The </w><w c5="NN1" hw="county" pos="SUBST">county </w><w c5="NN1" hw="council" pos="SUBST">council </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="VVG" hw="face" pos="VERB">facing </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="AJ0" hw="serious" pos="ADJ">serious </w><w c5="NN1" hw="cash" pos="SUBST">cash </w><w c5="NN1" hw="crisis" pos="SUBST">crisis </w><w c5="CJC" hw="and" pos="CONJ">and </w><w c5="VVZ-NN2" hw="need" pos="VERB">needs </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="reduce" pos="VERB">reduce </w><w c5="NN0" hw="staff" pos="SUBST">staff </w><w c5="PRP-AVP" hw="by" pos="PREP">by </w><w c5="AV0" hw="about" pos="ADV">about </w><w c5="CRD" hw="400" pos="ADJ">400 </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="meet" pos="VERB">meet </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="NN1-AJ0" hw="spending" pos="SUBST">spending </w><w c5="NN1" hw="cut-back" pos="SUBST">cut-back </w><w c5="PRF" hw="of" pos="PREP">of </w><w c5="NN0" hw="£12m" pos="UNC">£12m </w><w c5="DT0" hw="this" pos="ADJ">this </w><w c5="AJ0" hw="financial" pos="ADJ">financial </w><w c5="NN1" hw="year" pos="SUBST">year</w><c c5="PUN">.</c></s></p><p>
<s n="7"><w c5="PRP" hw="in" pos="PREP">In </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="NN1" hw="letter" pos="SUBST">letter </w><w c5="PRP" hw="to" pos="PREP">to </w><w c5="DT0" hw="all" pos="ADJ">all </w><w c5="AJ0" hw="white" pos="ADJ">white </w><w c5="NN1" hw="collar" pos="SUBST">collar </w><w c5="CJC" hw="and" pos="CONJ">and </w><w c5="AJ0" hw="manual" pos="ADJ">manual </w><w c5="NN2" hw="worker" pos="SUBST">workers</w><c c5="PUN">, </c><w c5="AJ0" hw="chief" pos="ADJ">chief </w><w c5="NN1" hw="executive" pos="SUBST">executive </w><w c5="NP0" hw="roger" pos="SUBST">Roger </w><w c5="NP0" hw="davies" pos="SUBST">Davies </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="VVG" hw="ask" pos="VERB">asking </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN2" hw="over-50" pos="SUBST">over-50s </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="consider" pos="VERB">consider </w><w c5="AT0" hw="an" pos="ART">an </w><w c5="AJ0" hw="early" pos="ADJ">early </w><w c5="NN1" hw="retirement" pos="SUBST">retirement </w><w c5="NN1" hw="scheme" pos="SUBST">scheme </w><w c5="PRP" hw="under" pos="PREP">under </w><w c5="DTQ" hw="which" pos="PRON">which </w><mw c5="AV0"><w c5="AVP" hw="up" pos="ADV">up </w><w c5="PRP" hw="to" pos="PREP">to </w></mw><w c5="CRD" hw="ten" pos="ADJ">ten </w><w c5="AJ0" hw="added" pos="ADJ">added </w><w c5="NN2" hw="year" pos="SUBST">years </w><w c5="NN2" hw="benefit" pos="SUBST">benefits </w><w c5="VBB" hw="be" pos="VERB">are </w><w c5="VBG" hw="be" pos="VERB">being </w><w c5="VVN" hw="offer" pos="VERB">offered </w><w c5="CJS" hw="if" pos="CONJ">if </w><w c5="NN0" hw="people" pos="SUBST">people </w><w c5="VVB" hw="agree" pos="VERB">agree </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="go" pos="VERB">go </w><w c5="PRP" hw="before" pos="PREP">before </w><w c5="NP0" hw="june" pos="SUBST">June </w><w c5="CRD" hw="25" pos="ADJ">25</w><c c5="PUN">.</c></s></p><p>
<s n="8"><w c5="AT0" hw="the" pos="ART">The </w><w c5="NN1" hw="council" pos="SUBST">council </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="VVG" hw="offer" pos="VERB">offering </w><w c5="AJ0" hw="enhanced" pos="ADJ">enhanced </w><w c5="AJ0" hw="compensatory" pos="ADJ">compensatory </w><w c5="NN2" hw="benefit" pos="SUBST">benefits </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="make" pos="VERB">make </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="scheme" pos="SUBST">scheme </w><w c5="AV0" hw="more" pos="ADV">more </w><w c5="AJ0" hw="attractive" pos="ADJ">attractive</w><c c5="PUN">.</c></s></p><p>

以下是我正在使用的代码：

import os
from lxml import etree

f = open("lnc_all2.txt", "a")

directory = "my/path/to/the/XML/files"

for filename in os.listdir(directory):
    if filename.endswith(".xml"):
        tree = etree.parse(filename)

        tree = etree.strip_elements(tree, "fileDesc")
        text = etree.tostring(tree, method = "text", encoding = "unicode", pretty_print = True)
        f.write(text)
        f.write("\n" + "\n")

请注意，在代码中我尝试删除'fileDesc'标记，但问题似乎与任何其他标记保持一致。

再一次，如果答案是在那里表示道歉，但作为新手程序员，以及使用XML的新手，我无法将我在其他各种线程中找到的信息拼凑起来。

非常感谢您的帮助！：）

Answer 1

您可以从要从中提供文本的节点中选择文本，而不是从树中删除元素。见下面的例子。作为旁注，我认为文本编码为ISO-8859-1而不是unicode。

import os
from lxml import etree

f = open("lnc_all2.txt", "a")

directory = "my/path/to/the/XML/files"

for filename in os.listdir(directory):
    if filename.endswith(".xml"):
        tree = etree.parse(filename)

        wtext = tree.xpath('//wtext')[0]

        text = ''.join(wtext.itertext())
        text = text.encode('ISO-8859-1')

        f.write(text)
        f.write("\n" + "\n")

f.close()

编辑（对于黑名单方法而非白名单）

import os
from lxml import etree

f = open("lnc_all2.txt", "a")

directory = "my/path/to/the/XML/files"

os.chdir(directory)

for filename in os.listdir(directory):
    if filename.endswith(".xml"):
        tree = etree.parse(filename)

        for elem in tree.iter('fileDesc', 'encodingDesc', 'profileDesc'):
            elem.getparent().remove(elem)

        text = ''.join(tree.getroot().itertext())
        text = text.encode('ISO-8859-1')

        f.write(text)
        f.write("\n" + "\n")

f.close()

使用lxml时，不能在XML文件上同时使用'tostring'和'strip.elements'

1 个答案: