从文章中提取段落|正则表达式

时间:2016-12-09 13:23:28

标签: regex python-3.x

我抓了几篇关于恐怖袭击的文章。从这些文章中我想提取一个特定的段落。

这是一篇文章的样本:

By   DAVID D. KIRKPATRICK    MARCH 18, 2015 
Scenes from Tunisian state television showed confusion outside an art museum and Parliament on Wednesday after gunmen attacked.
CAIRO — Gunmen in military uniforms killed 19 people on Wednesday in a
midday attack on a museum in downtown Tunis, dealing a new blow to the tourist industry 
that is vital to  Tunisia  as it struggles to consolidate the only transition to democracy 
after the Arab Spring revolts. 
Tunisian officials had initially said that the attackers took 10
hostages and killed nine people, including seven foreign visitors and two Tunisians.

我想要提取以供进一步分析的是,在本示例中,文本从:“CAIRO - ”到第一个fullstop。

我想出了

This is the regular expression

([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s

使用这个正则表达式,我只提取段落的起点,但我不提取其余部分。

2 个答案:

答案 0 :(得分:2)

使用非贪婪的

(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)

?(或+)之后的*使其变得非贪婪。这意味着它只会尽可能少地匹配,而不是正常行为,它尽可能匹配。

答案 1 :(得分:0)

<强> EDIT1:

尝试正则表达式如下:

([A-Z]+\w+\s*—\s*.*?\.)

它是关于分组的,虽然它与您想要的文本相匹配。

尝试以下正则表达式(使用parenthisis包围正则表达式):

(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s)

第1组包含必需的字符串/文本。

图片参考: enter image description here