正则表达式通过匹配第一行中的单词来选择整个段落

时间:2017-08-31 04:40:54

标签: regex

我使用以下正则表达式匹配以单词&#34开头的段落;摘要",

([^\']*(?=Summary)[^\']*)

但它匹配所有文字:regex101a

也试过

(?<=Summary).*?(?=]\.)

这与任何内容都不匹配:regex101b

我认为这与文本文件的格式有关。

以下是一个例子:

COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence was derived from AC105339.9 and FJ695193.1.
            This sequence is a reference standard in the RefSeqGene project.

        Summary: Adaptor protein complex 3 (AP-3 complex) is a
        heterotrimeric protein complex involved in the formation of
        clathrin-coated synaptic vesicles. The protein encoded by this gene
        represents the beta subunit of the neuron-specific AP-3 complex and
        was first identified as the target antigen in human paraneoplastic
        neurologic disorders. The encoded subunit binds clathrin and is
        phosphorylated by a casein kinase-like protein, which mediates
        synaptic vesicle coat assembly. Defects in this gene are a cause of
        early-onset epileptic encephalopathy. [provided by RefSeq, Feb
        2017].
PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
            1-35060             AC105339.9         88079-123138
            35061-35259         FJ695193.1         1-199               c
            35260-57628         AC105339.9         123337-145705

这就是我的目标:

    Summary: Adaptor protein complex 3 (AP-3 complex) is a
    heterotrimeric protein complex involved in the formation of
    clathrin-coated synaptic vesicles. The protein encoded by this gene
    represents the beta subunit of the neuron-specific AP-3 complex and
    was first identified as the target antigen in human paraneoplastic
    neurologic disorders. The encoded subunit binds clathrin and is
    phosphorylated by a casein kinase-like protein, which mediates
    synaptic vesicle coat assembly. Defects in this gene are a cause of
    early-onset epileptic encephalopathy. [provided by RefSeq, Feb
    2017].

1 个答案:

答案 0 :(得分:2)

我认为这是一个与您的段落匹配的强大模式(使用Multiline标志):

^\s+$\n^([ \t]+)Summary.*(?:\n\1[ \t]*\S.*)+

工作示例:https://regex101.com/r/P6KlBa/2

  • &#34;总结&#34;可能会出现在一行中的第一个单词。我们首先匹配一个空行,以确保&#34;摘要&#34;是在段落的开头。
  • ([ \t]+)捕获每行开头的空格数。某些口味horizontal spaces\h
  • Summary.* - 第一行以&#34;摘要&#34;。
  • 开头
  • (\n\1([ \t]+)*\S.*)* - 匹配更多非空行。