递归正则表达式模式

时间:2017-11-14 11:07:06

标签: python regex

我正在努力使用正则表达式。这是我正在撰写的文字:

* [[February 1]] – ''[[Brave New World]]'', a novel by [[Aldous Huxley]], is first published.
* [[February 2]]
** A general [[World Disarmament Conference]] begins in [[Geneva]]. The principal issue at the conference is the demand made by Germany for ''gleichberechtigung'' ("equality of status" i.e. abolishing Part V of the Treaty of Versailles, which had disarmed Germany) and the French demand for ''sécurité'' ("security" i.e. maintaining Part V).
** The [[League of Nations]] again recommends negotiations between the [[Republic of China (1912–49)|Republic of China]] and Japan.
** The [[Reconstruction Finance Corporation]] begins operations in Washington, D.C.
* [[February 4]]
** The [[1932 Winter Olympics]] open in [[Lake Placid, New York]].
** Japan occupies [[Harbin]], China.
* [[February 9]] – [[Junnosuke Inoue]], prominent Japanese businessman, banker and former governor of the Bank of Japan is assassinated by right-wing extremist group the League of Blood in the [[League of Blood Incident]].
* [[February 11]] – [[Pope Pius XI]] meets [[Benito Mussolini]] in [[Vatican City]].

我希望有一个正则表达式来匹配以*开头的所有行,后跟任意数量的以**开头的行。理想情况下,我希望在一个组中包含**的每一行。

以下是我想要的结果:

> Match 1:
>> Group 1: "* [[February 2]]"

>> Group 2: "** A general [...] Part V)."

>> Group 3: "** The [[League of Nations]] [...] and Japan."

>> Group 4: "** The [[Reconstruction Finance Corporation]] begins operations in Washington, D.C."

> Match 2: 
>> Group 1: "* [[February 4]]"

>> Group 2: "** The [[1932 Winter Olympics]] open in [[Lake Placid, New York]]."

>> Group 3: "** Japan occupies [[Harbin]], China."

(我已将[......]用于缩短目的。)

这是 我来到这个模式:/(*ANY)^\*{1} (.*)\n(?>(^\*{2}(.*?)\n)+)/gm,这里是regex101的链接,我在那里测试我的正则表达式:https://regex101.com/r/ubtnMg/1

以下是我的模式的说明: * (*ANY)匹配任何换行序列,因为我不确定他们在文本中使用哪个换行符。 * ^\*{1} (.*)\n匹配以*开头的任何行,捕获该行的文本,直到有换行符。 * (?>(^\*{2}(.*?)\n)+)是棘手的部分。它应该匹配以^\*{1} (.*)\n开头的**之后的每一行,捕获文本直到组中的行尾,直到找到以*开头的新行

它实际上给了我这个:

> Match 1: "* [[February 2]]
** A general [[World Disarmament Conference]] begins in [[Geneva]]. The principal issue at the conference is the demand made by Germany for ''gleichberechtigung'' ("equality of status" i.e. abolishing Part V of the Treaty of Versailles, which had disarmed Germany) and the French demand for ''sécurité'' ("security" i.e. maintaining Part V).
** The [[League of Nations]] again recommends negotiations between the [[Republic of China (1912–49)|Republic of China]] and Japan.
** The [[Reconstruction Finance Corporation]] begins operations in Washington, D.C."
>> Group 1: "[[February 2]]"

>> Group 2: "** The [[Reconstruction Finance Corporation]] begins operations in Washington, D.C."

>> Group 3: "The [[Reconstruction Finance Corporation]] begins operations in Washington, D.C."

> Match 2: "* [[February 4]]
** The [[1932 Winter Olympics]] open in [[Lake Placid, New York]].
** Japan occupies [[Harbin]], China."
>> Group 1: "[[February 4]]"

>> Group 2: "** Japan occupies [[Harbin]], China"

>> Group 3: " Japan occupies [[Harbin]], China."

我希望我已经足够清楚,你可以帮助我。不要犹豫,询问更多细节。

1 个答案:

答案 0 :(得分:0)

感谢Rawing的评论,我发现了这个解决方案:

首先,我使用这种模式:/(*ANY)^\*{1} (.*)\n(^\*{2}(.*?)\n)+/gm来匹配每个文本块,如下所示:

* [[February 2]]
** A general [[World Disarmament Conference]] begins in [[Geneva]]. The principal issue at the conference is the demand made by Germany for ''gleichberechtigung'' ("equality of status" i.e. abolishing Part V of the Treaty of Versailles, which had disarmed Germany) and the French demand for ''sécurité'' ("security" i.e. maintaining Part V).
** The [[League of Nations]] again recommends negotiations between the [[Republic of China (1912–49)|Republic of China]] and Japan.
** The [[Reconstruction Finance Corporation]] begins operations in Washington, D.C.

然后我使用此模式获取以*开头的行:/^\*{1}(.*)/g。 我还使用此模式获取以**开头的每一行:/^\*{2}(.*)$/gm