Question

我有一个大型（~50Mb）文件，其中包含格式不正确的XML，用于描述<item> </item>标记之间的文档和属性，我想从所有英文文档中提取文本。

Python的标准XML解析实用程序（dom，sax，expat）阻塞了错误的格式，更宽容的库（sgmllib，BeautifulSoup）解析整个文件并花费太长时间。

<item>
  <title>some title</title>
  <author>john doe</author>
  <lang>en</lang>
  <document> .... </document>
</item>

是否有人知道在<document> </document> 之间提取文本的方法，只有 lang=en而不解析整个文档？

其他信息：为什么“格式不正确”

某些文档具有属性<dc:link></dc:link>，这会导致解析器出现问题。 Python的xml.minidom抱怨：

ExpatError: unbound prefix: line 13, column 0

Answer 1

如果你有傻瓜

gawk 'BEGIN{
 RS="</item>"
 startpat="<document>"
 endpat="</document>"
 lpat=length(startpat)
 epat=length(endpat)
}
/<lang>en<\/lang>/{
    match($0,"<document>")
    start=RSTART
    match($0,"</document>")
    end=RSTART
    print substr($0,start+lpat,end-(start+lpat)) 
}' file

输出

$ more file
Junk
Junk
<item>
  <title>some title</title>
  <author>john doe</author>
  <lang>en</lang>
  <document> text
         i want blah ............  </document>
</item>
junk
junk
<item>
  <title>some title</title>
  <author>jane doe</author>
  <lang>ch</lang>
  <document> junk text
           ..       ............ </document>
</item>
junk
blahblah..
<item>
  <title>some title</title>
  <author>GI joe</author>
  <lang>en</lang>
  <document>  text i want ..... in one line  </document>
</item>
aksfh
aslkfj
dflkas

$ ./shell.sh
 text
         i want blah ............
  text i want ..... in one line

Answer 2

你需要一些面向事件的解析器，比如SAX，或者在.NET中，System.Xml.XmlReader;

Answer 3

根据文档被“破坏”的方式（以及如何严重），可以在perl / python中编写一个简单的过滤器，修复它足以传递XML格式良好的测试并使其成为DOM或XSLT。

您可以添加一些输入错误的示例吗？

Answer 4

我认为如果你对Java没问题，那么VTD-XML可以在没有任何未定义前缀的问题的情况下工作......

从大型格式不良的XML文件的特定元素中提取文本

4 个答案: