我正在解析Wikipedia的某些信息,转储中的文本包括{{content}}
或[[content]]
形式的链接和图像的特殊注释。我想将文本分成句子,但是当该点后没有空格,而是前面的符号之一时,就会出现问题。
因此,通常,它必须在'. ', '.{{', '.[['
发生时分裂。
示例:
prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', prueba)
我再次在此处粘贴文字以方便阅读
无政府主义不能从单一的特定世界观提供固定的学说。{{sfn | Marshall | 1993 | pp = 14-17}}存在许多类型和传统的无政府主义,但并非所有类型和传统都是相互排斥的。[ [Sylvan | 2007 | p = 262]] [[无政府主义者的思想流派]]可以有根本的不同,可以支持从极端[[个人主义]]到完整[[集体主义]]的任何事物。
此代码的输出是一个列表,其中只有一项包含整个文本:
['Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[sfn|Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']
但是我需要获得包含以下三个项目的列表:
['Anarchism does not offer a fixed body of doctrine from a single particular worldview.', '{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.', '[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']
如何修复我的正则表达式代码?我尝试了不同的解决方案,但没有得到想要的结果。
谢谢。
答案 0 :(得分:0)
由于似乎您要保留定界符,所以您可能想要re.findall()
。请参阅下面的答案https://stackoverflow.com/a/44244698/11199887,然后将其复制以适合您的情况。使用re.findall()
,您不必担心.{{
与.
和.[[
import re
s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']
在上面的示例中,您不仅捕获了句号,还捕获了结束句子的问号和感叹号。在Wikipedia上可能没有很多以感叹号或问号结尾的句子,但是我并没有真正花费时间寻找示例
对于您的情况,它看起来像这样:
prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'
sentences = re.findall('.*?[.!\?]', prueba)
或者如果您真的只想按期分割。
sentences = re.findall('.*?[.]', prueba)
print(sentences)
的输出是:
['Anarchism does not offer a fixed body of doctrine from a single particular worldview.',
'{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.',
'[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']