我正在尝试搜索大约500个XML文档以查找某些特定短语,并输出包含任何这些短语的任何元素的ID。目前,这是我的代码:
from lxml import etree
import os
import re
files = os.listdir('C:/Users/Me/Desktop/xml')
search_words = ['House divided', 'Committee divided', 'on Division', 'Division List',
'The Ayes and the Noes',]
for f in files:
doc = etree.parse('C:/Users/Me/Desktop/xml/' +f)
for elem in doc.iter():
for word in search_words:
if elem.text is not None and str(elem.attrib) != "{}" and word in elem.text and len(re.findall(r'\d+', elem.text))>1:
votes = re.findall(r'\d+', elem.text)
string = str(elem.attrib)[8:-2] + ","
string += (str(votes[0]) + "," + str(votes[1]) + ",")
string += word + ","
string += str(elem.sourceline)
print string
这样的输入将正确输出:
<p id="S3V0001P0-01869">The House divided; Against the Motion 83; For it 23—Majority 60.</p>
但是这样的嵌套元素的输入将被遗漏,因为内部的文本没有被解析为短语:
<p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER</member><membercontribution> said, that the precedent occurred on the 8th of April, 1850, on a Motion for going into a Committee of Supply. An Amendment was moved by Captain Boldero on the subject of assistant-surgeons in the navy, when, on a division being called for, the Question was put that the words proposed to be left out stand part of the Question. The House divided, when the numbers were—Ayes, 40; Noes, 48. The Question, "That the proposed words be added" was put and agreed to; the main Question, as amended, was put and agreed to; and the Question being then put, "That Mr. Speaker do now leave the chair," that Motion was agreed to, and the House went into Committee of Supply.</membercontribution></p>
有没有办法在这样的嵌套元素中读取文本并返回其ID?
答案 0 :(得分:1)
对于lxml,有一个xpath
方法,XPath有一个contains
函数,你可以使用它。例如。
doc = ET.fromstring('<p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER</member><membercontribution> said, that the precedent occurred on the 8th of April, 1850, on a Motion for going into a Committee of Supply. An Amendment was moved by Captain Boldero on the subject of assistant-surgeons in the navy, when, on a division being called for, the Question was put that the words proposed to be left out stand part of the Question. The House divided, when the numbers were—Ayes, 40; Noes, 48. The Question, "That the proposed words be added" was put and agreed to; the main Question, as amended, was put and agreed to; and the Question being then put, "That Mr. Speaker do now leave the chair," that Motion was agreed to, and the House went into Committee of Supply.</membercontribution></p>')
result = doc.xpath('//*[@id and contains(., $word)]', word = 'House divided')
答案 1 :(得分:0)
你可以使用一些XPath并提取所有有趣的文本元素。我喜欢Parsel:pip install parsel
。
import parsel
data = ('<x><y><z><p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER'
'</member><membercontribution> said, that the precedent occurred on the '
'8th of April, 1850, on a Motion ...</membercontribution></p></z></y></x>')
selector = parsel.Selector(data)
for para in selector.xpath('//p'):
id = para.xpath('@id').extract_first()
texts = para.xpath('*/text()').extract()
for text in texts:
# do whatever search
print(id, len(text), 'April' in text)
输出:
S3V0141P0-01248 31 False
S3V0141P0-01248 77 True