什么是正则表达式匹配这种模式?

时间:2019-03-07 01:35:15

标签: python regex

我要匹配以下文本。模式是一个在新行中以数字开头(例如2.1)的项目,然后是一个或多个这样的项目。有些项目可能会分布在多行中,例如2.1。我想匹配一块这样的物品。

模式为:

(以2.1之类的数字开头的新行,可能跟着以2.1之类的数字开头的一行或多行,然后是一个或多个此类模式

2.1 [ii] Agreement and Plan of Reorganization, by and among the Company,
Force Acq. Corp. and Force Computers, Inc. as amended.
3.1 [viii] Articles of Incorporation of Company, as amended.
3.2 [viii] Bylaws of Company.
10.1 [I] Preferred Stock Purchase Agreement dated September 29, 1983,
together with amendments thereto dated February 28, 1984 and
10.2 [I] Form of Indemnification Agreement between Company and its
officers, directors and certain other key employees.
10.3 [I] Amendment to form of Indemnification Agreement.
10.4 [iv] 1983 Incentive Stock Option Plan, as amended August 13, 1991.
10.5 [vi] 1988 Employee Stock Purchase Plan, as amended October 1992.
10.6 [v] Amended and Restated 1992 Stock Option Plan.

这是我的正则表达式:

pattern = r"(?:\n\d{1,2}\.\d{1,2}.{1,200}){2,}\n"

text = re.sub(pattern,"", text, re.S)

还没有。 Dotall没有帮助。谢谢!

作为中间步骤,如何匹配不以\ d {1,2}。\ d {1,2}开头的行?负向后搜索不适用于可变长度。

以下是一些示例文本:

2.01 Acquisition Agreement dated as of March 26, 1997 by and between
registrant and ISAR-Vermogensverwaltung Gbr mbH ("ISAR")(1)

3.01 Registrant's Amended and Restated Articles of Incorporation, as
amended(2)

3.02 Registrant's Certificate of Amendment of Articles of
Incorporation filed prior to the closing of registrant's initial
public offering(2)

3.03 Registrant's Amended and Restated Articles of Incorporation
filed following the closing of registrant's initial public
offering(2)

3.04 Registrant's Bylaws(2)
3.05 Registrant's Amended and Restated Bylaws adopted prior to the
closing of registrant's initial public offering(2)
3.06 Certificate of Amendment of Amended and Restated Articles of
Versant Object Technology Corporation(7)

3.07 Registrant's Certificate of Determination dated July 12, 1999,
incorporated by reference to the Company's current report on
Form 8-K (Exhibit 3.01) filed July 12, 1999.

4.01 [intentionally omitted]
4.02 Preferred Stock Purchase Agreement, dated as of April 27, 1994,
as amended(2)

10.01 Registrant's 1989 Stock Option Plan, as amended, and related
documents(2)**

10.02 Registrant's 1996 Equity Incentive Plan, as amended, and related
documents(3)**

10.03 Registrant's 1996 Directors Stock Option Plan, as amended, and
related documents(4)**

与众不同的特征是: (1)它们以2.01和10.03之类的数字开头 (2)其中有很多(至少2个)聚集在一起。 违规行为包括: (1)有些分布在多行中,例如2.01,有些分布在一行中,例如2.04。 (2)它们之间可能有空白行,也可能没有空白行,在2.01和3.01之间,在3.04和3.05之间没有。

我想匹配此类文本的完整内容并将其删除。其他文本是普通句子。其中一些可能以数字开头,例如标题的2.1,但如上所述,它们并没有聚集在一起。

2 个答案:

答案 0 :(得分:1)

如果要捕获每个组件,则可以为每个组件分组。选中here

import json
import pandas as pd
with open('FakeNewsContent/BuzzFeed_Fake_1-Webpage.json', 'r') as f:
    data = json.load(f)
df = pd.DataFrame(data)

输出:

node() {

  stage('configure slave1') {

    try {
        def version = sh (script: 'chromedriver --version', returnStdout: true).trim()

        if (version) {
            sh "....." 
        }
    }
    catch(err) {
        echo "chromedriver: not find"
    }

  }

}

答案 1 :(得分:1)

如果您只想将每个段落作为一个项目,则建议以下内容:

import re
text = """ 2.1 [ii] Agreement and Plan of Reorganization, by and among the Company,
Force Acq. Corp. and Force Computers, Inc. as amended.
3.1 [viii] Articles of Incorporation of Company, as amended.
3.2 [viii] Bylaws of Company.
10.1 [I] Preferred Stock Purchase Agreement dated September 29, 1983,
together with amendments thereto dated February 28, 1984 and
10.2 [I] Form of Indemnification Agreement between Company and its
officers, directors and certain other key employees.
10.3 [I] Amendment to form of Indemnification Agreement.
10.4 [iv] 1983 Incentive Stock Option Plan, as amended August 13, 1991.
10.5 [vi] 1988 Employee Stock Purchase Plan, as amended October 1992.
10.6 [v] Amended and Restated 1992 Stock Option Plan."""

text = re.findall(r"\d{1,2}\.\d+.*?(?=\d{1,2}\.\d+|$)", text, re.S)

for paragraph in text:
    print(paragraph)

这将产生:

2.1 [ii] Agreement and Plan of Reorganization, by and among the Company,
Force Acq. Corp. and Force Computers, Inc. as amended.

3.1 [viii] Articles of Incorporation of Company, as amended.

3.2 [viii] Bylaws of Company.

10.1 [I] Preferred Stock Purchase Agreement dated September 29, 1983,
together with amendments thereto dated February 28, 1984 and

10.2 [I] Form of Indemnification Agreement between Company and its
officers, directors and certain other key employees.

10.3 [I] Amendment to form of Indemnification Agreement.

10.4 [iv] 1983 Incentive Stock Option Plan, as amended August 13, 1991.

10.5 [vi] 1988 Employee Stock Purchase Plan, as amended October 1992.

10.6 [v] Amended and Restated 1992 Stock Option Plan.

键是。* 后面的,因此评估是懒惰的。这意味着正则表达式匹配它必须满足的所有条件,但不能满足所有条件。如果您保留,则它与字符串的其余部分匹配。

(?= ...)使您可以省略结果中的正则表达式,从而使所有内容都匹配到下一段。我希望这会有所帮助。