Python - 如何提取包含引文标记的句子?

时间:2017-08-13 14:10:39

标签: python regex text-segmentation citations

text = "Trondheim is a small city with a university and 140000 inhabitants. Its central bus systems has 42 bus lines, serving 590 stations, with 1900 (departures per) day in average. T h a t gives approximately 60000 scheduled bus station passings per day, which is somehow represented in the route data base. The starting point is to automate the function (Garry Weber, 2005) of a route information agent."
print re.findall(r"([^.]*?\(.+ [0-9]+\)[^.]*\.)",text)

我使用上面的代码在其中引用带有引文的句子。如你所见,最后一句包含引文(Garry Weber,2005)。

但我得到了这个结果:

[' Its central bus systems has 42 bus lines, serving 590 stations, with 1900 (departures per) day in average. T h a t gives approximately 60000 scheduled bus station passings per day, which is somehow represented in the route data base. The starting point is to automate the function (Garry Weber, 2005) of a route information agent.']

结果应该是仅包含引用的句子,如下所示:
起点是自动化路线信息代理的功能(Garry Weber,2005)。

我猜这个问题是由括号内的文字引起的,你可以在它包含的第二行看到(离开),我的代码的任何解决方案?

1 个答案:

答案 0 :(得分:2)

我的尝试。 Live demo

\b[^.]+\([^()]+\b(\d{2}|\d{4})\s*\)[^.]*\.

它精确地捕捉了句子,并且比年份更具体。