Question

我使用以下正则表达式匹配以单词＆＃34开头的段落;摘要＆＃34;，

([^\']*(?=Summary)[^\']*)

但它匹配所有文字：regex101a

也试过

(?<=Summary).*?(?=]\.)

这与任何内容都不匹配：regex101b

我认为这与文本文件的格式有关。

以下是一个例子：

COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence was derived from AC105339.9 and FJ695193.1.
            This sequence is a reference standard in the RefSeqGene project.

        Summary: Adaptor protein complex 3 (AP-3 complex) is a
        heterotrimeric protein complex involved in the formation of
        clathrin-coated synaptic vesicles. The protein encoded by this gene
        represents the beta subunit of the neuron-specific AP-3 complex and
        was first identified as the target antigen in human paraneoplastic
        neurologic disorders. The encoded subunit binds clathrin and is
        phosphorylated by a casein kinase-like protein, which mediates
        synaptic vesicle coat assembly. Defects in this gene are a cause of
        early-onset epileptic encephalopathy. [provided by RefSeq, Feb
        2017].
PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
            1-35060             AC105339.9         88079-123138
            35061-35259         FJ695193.1         1-199               c
            35260-57628         AC105339.9         123337-145705

这就是我的目标：

    Summary: Adaptor protein complex 3 (AP-3 complex) is a
    heterotrimeric protein complex involved in the formation of
    clathrin-coated synaptic vesicles. The protein encoded by this gene
    represents the beta subunit of the neuron-specific AP-3 complex and
    was first identified as the target antigen in human paraneoplastic
    neurologic disorders. The encoded subunit binds clathrin and is
    phosphorylated by a casein kinase-like protein, which mediates
    synaptic vesicle coat assembly. Defects in this gene are a cause of
    early-onset epileptic encephalopathy. [provided by RefSeq, Feb
    2017].

Answer 1

我认为这是一个与您的段落匹配的强大模式（使用Multiline标志）：

^\s+$\n^([ \t]+)Summary.*(?:\n\1[ \t]*\S.*)+

工作示例：https://regex101.com/r/P6KlBa/2

＆＃34;总结＆＃34;可能会出现在一行中的第一个单词。我们首先匹配一个空行，以确保＆＃34;摘要＆＃34;是在段落的开头。
([ \t]+)捕获每行开头的空格数。某些口味horizontal spaces有\h。
Summary.* - 第一行以＆＃34;摘要＆＃34;。
(\n\1([ \t]+)*\S.*)* - 匹配更多非空行。

正则表达式通过匹配第一行中的单词来选择整个段落

1 个答案: