我必须在
之间提取文本</cons> and <con
使用Notepad++
多次出现在文本文件的句子中
我的示例数据是这样的:
<abstract>
<sentence>The <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> interacts with <cons lex="non-polymorphic_region" sem="G#protein_domain_or_region">non-polymorphic regions</cons> of <cons lex="major_histocompatibility_complex_class_II_molecule" sem="G#protein_family_or_group">major histocompatibility complex class II molecules</cons> on <cons lex="antigen-presenting_cell" sem="G#cell_type">antigen-presenting cells</cons> and contributes to <cons lex="T_cell_activation" sem="G#other_name">T cell activation</cons>.</sentence>
<sentence>We have investigated the effect of <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> on <cons lex="T_cell_activating_signal" sem="G#other_name">T cell activating signals</cons> in a <cons lex="lymphoma_model" sem="G#other_name">lymphoma model</cons> using <cons lex="monoclonal_antibody" sem="G#protein_family_or_group">monoclonal antibodies</cons> (<cons lex="mAb" sem="G#protein_domain_or_region">mAb</cons>) which recognize different <cons lex="CD4_epitope" sem="G#protein_family_or_group">CD4 epitopes</cons>.</sentence>
<sentence>We demonstrate that <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> delivers signals capable of activating the <cons lex="NF-AT_transcription_factor" sem="G#protein_molecule">NF-AT transcription factor</cons> which is required for <cons lex="interleukin-2_gene_expression" sem="G#other_name"><cons lex="interleukin-2" sem="G#protein_molecule">interleukin-2</cons> gene expression</cons>.</sentence>
<sentence>Whereas different <cons lex="anti-CD4_mAb" sem="G#protein_family_or_group">anti-CD4 mAb</cons> or <cons lex="HIV-1_gp120" sem="G#protein_molecule"><cons lex="HIV-1" sem="G#virus">HIV-1</cons> gp120</cons> could all trigger activation of the <cons lex="protein_tyrosine_kinase" sem="G#protein_family_or_group">protein tyrosine kinases</cons> <cons lex="p56lck" sem="G#protein_molecule">p56lck</cons> and <cons lex="p59fyn" sem="G#protein_molecule">p59fyn</cons> and phosphorylation of the <cons lex="Shc_adaptor_protein" sem="G#protein_molecule">Shc adaptor protein</cons>, which mediates signals to <cons lex="Ras" sem="G#protein_family_or_group">Ras</cons>, they differed significantly in their ability to activate <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons>.</sentence>
<sentence>Lack of full activation of <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons> could be correlated to a dramatically reduced capacity to induce <cons lex="calcium_flux" sem="G#other_name"><cons lex="calcium" sem="G#atom">calcium</cons> flux</cons> and could be complemented with a <cons lex="calcium_ionophore" sem="G#other_organic_compound">calcium ionophore</cons>.</sentence>
<sentence>The results identify functionally distinct <cons lex="epitope" sem="G#protein_family_or_group">epitopes</cons> on the <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> involved in activation of the <cons lex="Ras/protein_kinase_C_and_calcium_pathway" sem="G#other_name"><cons lex="Ras/protein_kinase_C" sem="G#protein_molecule"><cons lex="Ras/protein_kinase_C_pathway" sem="G#other_name"><cons lex="Ras" sem="G#protein_molecule">Ras</cons><cons lex="protein_kinase_C" sem="G#protein_molecule">/protein kinase C</cons></cons></cons> and <cons lex="calcium_pathway" sem="G#other_name">calcium pathways</cons></cons>.</sentence>
</abstract>
我想要的输出
interacts with
of
on
and contributes to
on
in
using
which recognize different
triggering
delivers signals capable of activating the
which is required for
or
could all trigger activation of the
and
我试过正则表达式
.*<\/cons>(.*?)<cons.* and replace with with $1
只给出了最后一次出现
的数据</cons> and <con
来自每个句子,而我的句子包含多个这些标签。谁能帮助我?
答案 0 :(得分:0)
它将用空格替换所有xml标签(您也可以在换字段中添加换行符)
它会给你留下字符串: -
CD4共同受体与抗原呈递细胞上主要组织相容性复合物II类分子的非多态性区域相互作用,并促进T细胞活化。 我们使用识别不同CD4表位的单克隆抗体(mAb)研究了CD4触发对淋巴瘤模型中T细胞激活信号的影响。 我们证明CD4触发提供能够激活白细胞介素-2基因表达所需的NF-AT转录因子的信号。 尽管不同的抗CD4 mAb或HIV-1 gp120都可以触发蛋白酪氨酸激酶p56lck和p59fyn的激活以及介导Ras信号的Shc衔接蛋白的磷酸化,但它们激活NF-AT的能力显着不同。 缺乏NF-AT的完全激活可能与诱导钙通量的能力显着降低相关,并且可以补充钙离子载体。 结果鉴定了参与Ras /蛋白激酶C和钙途径活化的CD4共同受体上功能不同的表位。
我希望它有所帮助。
答案 1 :(得分:0)
使用正则表达式解析XML很困难。最好使用XML解析器。以下Python 3 SAX内容解析器会在解析</cons>
结束标记(self.state = 1
)时跟踪,如果后面紧跟文本内容(self.state = 2
),则会立即跟踪{ {1}}启动元素。如果是,则打印内容:
cons
输出:
import xml.sax
data = b'''\
<abstract>
<sentence>The <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> interacts with <cons lex="non-polymorphic_region" sem="G#protein_domain_or_region">non-polymorphic regions</cons> of <cons lex="major_histocompatibility_complex_class_II_molecule" sem="G#protein_family_or_group">major histocompatibility complex class II molecules</cons> on <cons lex="antigen-presenting_cell" sem="G#cell_type">antigen-presenting cells</cons> and contributes to <cons lex="T_cell_activation" sem="G#other_name">T cell activation</cons>.</sentence>
<sentence>We have investigated the effect of <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> on <cons lex="T_cell_activating_signal" sem="G#other_name">T cell activating signals</cons> in a <cons lex="lymphoma_model" sem="G#other_name">lymphoma model</cons> using <cons lex="monoclonal_antibody" sem="G#protein_family_or_group">monoclonal antibodies</cons> (<cons lex="mAb" sem="G#protein_domain_or_region">mAb</cons>) which recognize different <cons lex="CD4_epitope" sem="G#protein_family_or_group">CD4 epitopes</cons>.</sentence>
<sentence>We demonstrate that <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> delivers signals capable of activating the <cons lex="NF-AT_transcription_factor" sem="G#protein_molecule">NF-AT transcription factor</cons> which is required for <cons lex="interleukin-2_gene_expression" sem="G#other_name"><cons lex="interleukin-2" sem="G#protein_molecule">interleukin-2</cons> gene expression</cons>.</sentence>
<sentence>Whereas different <cons lex="anti-CD4_mAb" sem="G#protein_family_or_group">anti-CD4 mAb</cons> or <cons lex="HIV-1_gp120" sem="G#protein_molecule"><cons lex="HIV-1" sem="G#virus">HIV-1</cons> gp120</cons> could all trigger activation of the <cons lex="protein_tyrosine_kinase" sem="G#protein_family_or_group">protein tyrosine kinases</cons> <cons lex="p56lck" sem="G#protein_molecule">p56lck</cons> and <cons lex="p59fyn" sem="G#protein_molecule">p59fyn</cons> and phosphorylation of the <cons lex="Shc_adaptor_protein" sem="G#protein_molecule">Shc adaptor protein</cons>, which mediates signals to <cons lex="Ras" sem="G#protein_family_or_group">Ras</cons>, they differed significantly in their ability to activate <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons>.</sentence>
<sentence>Lack of full activation of <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons> could be correlated to a dramatically reduced capacity to induce <cons lex="calcium_flux" sem="G#other_name"><cons lex="calcium" sem="G#atom">calcium</cons> flux</cons> and could be complemented with a <cons lex="calcium_ionophore" sem="G#other_organic_compound">calcium ionophore</cons>.</sentence>
<sentence>The results identify functionally distinct <cons lex="epitope" sem="G#protein_family_or_group">epitopes</cons> on the <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> involved in activation of the <cons lex="Ras/protein_kinase_C_and_calcium_pathway" sem="G#other_name"><cons lex="Ras/protein_kinase_C" sem="G#protein_molecule"><cons lex="Ras/protein_kinase_C_pathway" sem="G#other_name"><cons lex="Ras" sem="G#protein_molecule">Ras</cons><cons lex="protein_kinase_C" sem="G#protein_molecule">/protein kinase C</cons></cons></cons> and <cons lex="calcium_pathway" sem="G#other_name">calcium pathways</cons></cons>.</sentence>
</abstract>'''
class Handler(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self)
self.state = 0
self.content = ''
def characters(self,content):
if self.state == 1:
self.content = content
self.state = 2
else:
self.state = 0
def startElement(self,name,attr):
if name == 'cons' and self.state == 2:
print(self.content)
self.state = 0
def endElement(self,name):
if name == 'cons':
self.state = 1
else:
self.state = 0
xml.sax.parseString(data,Handler())
这是我在Notepad ++中使用正则表达式所做的最好的事情。它在最后一次替换后处理除文本之外的所有内容:
输出:
interacts with
of
on
and contributes to
on
in a
using
(
) which recognize different
delivers signals capable of activating the
which is required for
or
could all trigger activation of the
and
and phosphorylation of the
, which mediates signals to
, they differed significantly in their ability to activate
could be correlated to a dramatically reduced capacity to induce
and could be complemented with a
on the
involved in activation of the
and
答案 2 :(得分:0)
提取数据有一种简单的方法,如上面提到的notepad ++
search .*?</cons>([^<]*?)<cons
replace \1\r\n