使用beautifulsoup

时间:2015-10-29 13:45:37

标签: python regex xml beautifulsoup

我有一个XML文件,我正在寻找未标记的文本。

<body>
        <p>The prognosis of patients with rectal cancer has improved since the introduction of total mesorectal excision (TME) surgery [
            <xref ref-type="bibr" rid="CR1">1</xref>&#x02013;
            <xref ref-type="bibr" rid="CR3">3</xref>]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
            <xref ref-type="bibr" rid="CR1">1</xref>]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
            <xref ref-type="bibr" rid="CR3">3</xref>, 
            <xref ref-type="bibr" rid="CR4">4</xref>]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
            <xref ref-type="bibr" rid="CR5">5</xref>]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
            <xref ref-type="bibr" rid="CR6">6</xref>, 
            <xref ref-type="bibr" rid="CR7">7</xref>]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
            <xref ref-type="bibr" rid="CR5">5</xref>].
            </p>

</body>

因此正文可能包含多个<p>标记。我希望提取像

这样的文字
  

“]。使用这种手术技术的直肠系膜室包括   直肠和直肠周围的脂肪被尖锐的完全切除   沿直肠系膜筋膜(MRF)解剖[“

,位于CR3CR1之间,依此类推(即连续xref之间)。我还需要将此文本添加到字典中,该字典将相应的rid映射到跟随rid的文本列表。我怎么能用beautifulsoup和/或regexp来做到这一点。

3 个答案:

答案 0 :(得分:2)

下面代码为我工作 - 它创建了一个字典(映射)!

from bs4 import BeautifulSoup
from collections import defaultdict
import re

d= defaultdict(unicode)

html ='''
<body>
        <p>The prognosis of patients with rectal cancer has improved since the introduction of total mesorectal excision (TME) surgery [
            <xref ref-type="bibr" rid="CR1">1</xref>&#x02013;
            <xref ref-type="bibr" rid="CR3">3</xref>]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
            <xref ref-type="bibr" rid="CR1">1</xref>]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
            <xref ref-type="bibr" rid="CR3">3</xref>, 
            <xref ref-type="bibr" rid="CR4">4</xref>]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
            <xref ref-type="bibr" rid="CR5">5</xref>]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
            <xref ref-type="bibr" rid="CR6">6</xref>, 
            <xref ref-type="bibr" rid="CR7">7</xref>]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
            <xref ref-type="bibr" rid="CR5">5</xref>].
            </p>

</body>

'''

soup = BeautifulSoup(html,'html.parser')
l = soup.find_all('xref')
for i in l:
    e= i.next_element
    txt =  e.next_element.encode('utf-8')
    if re.match(r'\].+\[',txt) is not None:
        d[i.attrs['rid'].strip()]=txt.strip()
for k,v in d.items():
    print "The value of {0} is>>>>> {1} ".format(k,v)

打印 -

The value of CR3 is>>>>> ]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [ 
The value of CR1 is>>>>> ]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [ 
The value of CR7 is>>>>> ]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [ 
The value of CR4 is>>>>> ]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [ 
The value of CR5 is>>>>> ]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [ 

答案 1 :(得分:1)

这个怎么样?

html = """
<body>
        <p>The prognosis of patients with rectal cancer has improved since the introduction of total mesorectal excision (TME) surgery [
            <xref ref-type="bibr" rid="CR1">1</xref>&#x02013;
            <xref ref-type="bibr" rid="CR3">3</xref>]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
            <xref ref-type="bibr" rid="CR1">1</xref>]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
            <xref ref-type="bibr" rid="CR3">3</xref>, 
            <xref ref-type="bibr" rid="CR4">4</xref>]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
            <xref ref-type="bibr" rid="CR5">5</xref>]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
            <xref ref-type="bibr" rid="CR6">6</xref>, 
            <xref ref-type="bibr" rid="CR7">7</xref>]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
            <xref ref-type="bibr" rid="CR5">5</xref>].
            </p>

</body>
"""

import re
re.search('<xref ref-type="bibr" rid="CR3">3</xref>(.*)', a).group(1)

输出是:

']. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) ['

答案 2 :(得分:1)

检查一下(假设您的所有rid值都开始CR):

>>> from bs4 import BeautifulSoup as bs
>>> soup = bs(xml) # xml is your xml string text
>>> xml_dict = {'CR' + x.next_element:x.next_sibling.strip() for x in soup.findAll('xref')}
>>> print(xml_dict)

{u'CR3': u',', 
 u'CR1': u']. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [', 
 u'CR6': u',', 
 u'CR7': u']. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [', 
 u'CR4': u']. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [', 
 u'CR5': u'].'}