有条件地删除缩写词,但不能以python正则表达式结尾的句子

时间:2019-07-11 02:36:16

标签: python regex

我有些句子带有缩写词。目的是删除.(如果它以缩写形式出现,例如“ U.S.”),但如果它是句号信号正常句子结尾,则不删除。具体来说,以下测试文件

docs = ['U.S.S.R. line-continued', 'u.s.s.r. line-continued', 'U.S.S.R. Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.']

应转换为

['USSR line-continued', 'ussr line-continued', 'USSR Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.']

我正在尝试

[re.sub(r"((\w)\.){2,}", r"\1", doc) for doc in docs]

如果出现“字符后续”模式不止一次,则保留字符。但这行不通。

这有效

[re.sub(r"(\w)\.(\w)\.(\w)?\.?(\w)?\.?", r"\1\2\3\4", doc) for doc in docs]

,但是如果我有五个或更多带点的字符,则不能一概而论。

2 个答案:

答案 0 :(得分:2)

我有一个更简单的方法。使用此regex

import re
docs = ['U.S.S.R. line-continued', 'u.s.s.r. line-continued', 'U.S.S.R. Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.']
print ([re.sub(r"(?<!\w)([A-Za-z])\.", r"\1", doc) for doc in docs])

输出:

['USSR line-continued', 'ussr line-continued', 'USSR Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.']

答案 1 :(得分:1)

我猜想这个表达式或对该表达式的修改版本可能会起作用:

re.findall

使用import re regex = r"((?:\w\.){2,})" test_str = "docs = ['U.S.','U.S.S.R. line-continued', 'u.s.s.r. line-continued', 'U.S.S.R. Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.','U.S.S.R.U.S.S.R.U.S.S.R.U.S.S.R. line-continued']" print(re.findall(regex, test_str))

进行测试
['U.S.', 'U.S.S.R.', 'u.s.s.r.', 'U.S.S.R.', 'U.S.S.R.U.S.S.R.U.S.S.R.U.S.S.R.']

输出

re.finditer

使用import re regex = r"((?:\w\.){2,})" test_str = "docs = ['U.S.','U.S.S.R. line-continued', 'u.s.s.r. line-continued', 'U.S.S.R. Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.','U.S.S.R.U.S.S.R.U.S.S.R.U.S.S.R. line-continued'] " matches = re.finditer(regex, test_str, re.MULTILINE) for matchNum, match in enumerate(matches, start=1): print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group())) for groupNum in range(0, len(match.groups())): groupNum = groupNum + 1 print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

进行测试
{{1}}

this demo的右上角对表达式进行了说明,如果您想探索/简化/修改它,在this link中,您可以观察它如何与某些示例输入步骤匹配一步一步,如果您喜欢。