我有一个XML文件,我正在寻找未标记的文本。
<body>
<p>The prognosis of patients with rectal cancer has improved since the introduction of total mesorectal excision (TME) surgery [
<xref ref-type="bibr" rid="CR1">1</xref>–
<xref ref-type="bibr" rid="CR3">3</xref>]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
<xref ref-type="bibr" rid="CR1">1</xref>]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
<xref ref-type="bibr" rid="CR3">3</xref>,
<xref ref-type="bibr" rid="CR4">4</xref>]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
<xref ref-type="bibr" rid="CR5">5</xref>]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
<xref ref-type="bibr" rid="CR6">6</xref>,
<xref ref-type="bibr" rid="CR7">7</xref>]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
<xref ref-type="bibr" rid="CR5">5</xref>].
</p>
</body>
因此正文可能包含多个<p>
标记。我希望提取像
“]。使用这种手术技术的直肠系膜室包括 直肠和直肠周围的脂肪被尖锐的完全切除 沿直肠系膜筋膜(MRF)解剖[“
,位于CR3
和CR1
之间,依此类推(即连续xref
之间)。我还需要将此文本添加到字典中,该字典将相应的rid
映射到跟随rid
的文本列表。我怎么能用beautifulsoup和/或regexp来做到这一点。
答案 0 :(得分:2)
下面代码为我工作 - 它创建了一个字典(映射)!
from bs4 import BeautifulSoup
from collections import defaultdict
import re
d= defaultdict(unicode)
html ='''
<body>
<p>The prognosis of patients with rectal cancer has improved since the introduction of total mesorectal excision (TME) surgery [
<xref ref-type="bibr" rid="CR1">1</xref>–
<xref ref-type="bibr" rid="CR3">3</xref>]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
<xref ref-type="bibr" rid="CR1">1</xref>]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
<xref ref-type="bibr" rid="CR3">3</xref>,
<xref ref-type="bibr" rid="CR4">4</xref>]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
<xref ref-type="bibr" rid="CR5">5</xref>]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
<xref ref-type="bibr" rid="CR6">6</xref>,
<xref ref-type="bibr" rid="CR7">7</xref>]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
<xref ref-type="bibr" rid="CR5">5</xref>].
</p>
</body>
'''
soup = BeautifulSoup(html,'html.parser')
l = soup.find_all('xref')
for i in l:
e= i.next_element
txt = e.next_element.encode('utf-8')
if re.match(r'\].+\[',txt) is not None:
d[i.attrs['rid'].strip()]=txt.strip()
for k,v in d.items():
print "The value of {0} is>>>>> {1} ".format(k,v)
打印 -
The value of CR3 is>>>>> ]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
The value of CR1 is>>>>> ]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
The value of CR7 is>>>>> ]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
The value of CR4 is>>>>> ]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
The value of CR5 is>>>>> ]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
答案 1 :(得分:1)
这个怎么样?
html = """
<body>
<p>The prognosis of patients with rectal cancer has improved since the introduction of total mesorectal excision (TME) surgery [
<xref ref-type="bibr" rid="CR1">1</xref>–
<xref ref-type="bibr" rid="CR3">3</xref>]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
<xref ref-type="bibr" rid="CR1">1</xref>]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
<xref ref-type="bibr" rid="CR3">3</xref>,
<xref ref-type="bibr" rid="CR4">4</xref>]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
<xref ref-type="bibr" rid="CR5">5</xref>]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
<xref ref-type="bibr" rid="CR6">6</xref>,
<xref ref-type="bibr" rid="CR7">7</xref>]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
<xref ref-type="bibr" rid="CR5">5</xref>].
</p>
</body>
"""
import re
re.search('<xref ref-type="bibr" rid="CR3">3</xref>(.*)', a).group(1)
输出是:
']. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) ['
答案 2 :(得分:1)
检查一下(假设您的所有rid
值都开始CR
):
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs(xml) # xml is your xml string text
>>> xml_dict = {'CR' + x.next_element:x.next_sibling.strip() for x in soup.findAll('xref')}
>>> print(xml_dict)
{u'CR3': u',',
u'CR1': u']. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [',
u'CR6': u',',
u'CR7': u']. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [',
u'CR4': u']. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [',
u'CR5': u'].'}