拆分句子中带有特殊外观'。{{'的文本

时间:2019-05-09 20:27:42

标签: python regex

我正在解析Wikipedia的某些信息,转储中的文本包括{{content}}[[content]]形式的链接和图像的特殊注释。我想将文本分成句子,但是当该点后没有空格,而是前面的符号之一时,就会出现问题。

因此,通常,它必须在'. ', '.{{', '.[['发生时分裂。

示例:

prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'

sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', prueba)

我再次在此处粘贴文字以方便阅读

  

无政府主义不能从单一的特定世界观提供固定的学说。{{sfn | Marshall | 1993 | pp = 14-17}}存在许多类型和传统的无政府主义,但并非所有类型和传统都是相互排斥的。[ [Sylvan | 2007 | p = 262]] [[无政府主义者的思想流派]]可以有根本的不同,可以支持从极端[[个人主义]]到完整[[集体主义]]的任何事物。

此代码的输出是一个列表,其中只有一项包含整个文本:

['Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[sfn|Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']

但是我需要获得包含以下三个项目的列表:

['Anarchism does not offer a fixed body of doctrine from a single particular worldview.', '{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.', '[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']

如何修复我的正则表达式代码?我尝试了不同的解决方案,但没有得到想要的结果。

谢谢。

1 个答案:

答案 0 :(得分:0)

由于似乎您要保留定界符,所以您可能想要re.findall()。请参阅下面的答案https://stackoverflow.com/a/44244698/11199887,然后将其复制以适合您的情况。使用re.findall(),您不必担心.{{..[[

之间的区别
import re

s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']

在上面的示例中,您不仅捕获了句号,还捕获了结束句子的问号和感叹号。在Wikipedia上可能没有很多以感叹号或问号结尾的句子,但是我并没有真正花费时间寻找示例

对于您的情况,它看起来像这样:

prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'

sentences = re.findall('.*?[.!\?]', prueba)

或者如果您真的只想按期分割。

sentences = re.findall('.*?[.]', prueba)

print(sentences)的输出是:

['Anarchism does not offer a fixed body of doctrine from a single particular worldview.',
 '{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.',
 '[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']