我有些句子带有缩写词。目的是删除.
(如果它以缩写形式出现,例如“ U.S.”),但如果它是句号信号正常句子结尾,则不删除。具体来说,以下测试文件
docs = ['U.S.S.R. line-continued', 'u.s.s.r. line-continued', 'U.S.S.R. Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.']
应转换为
['USSR line-continued', 'ussr line-continued', 'USSR Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.']
我正在尝试
[re.sub(r"((\w)\.){2,}", r"\1", doc) for doc in docs]
如果出现“字符后续”模式不止一次,则保留字符。但这行不通。
这有效
[re.sub(r"(\w)\.(\w)\.(\w)?\.?(\w)?\.?", r"\1\2\3\4", doc) for doc in docs]
,但是如果我有五个或更多带点的字符,则不能一概而论。
答案 0 :(得分:2)
我有一个更简单的方法。使用此regex:
import re
docs = ['U.S.S.R. line-continued', 'u.s.s.r. line-continued', 'U.S.S.R. Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.']
print ([re.sub(r"(?<!\w)([A-Za-z])\.", r"\1", doc) for doc in docs])
输出:
['USSR line-continued', 'ussr line-continued', 'USSR Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.']
答案 1 :(得分:1)
我猜想这个表达式或对该表达式的修改版本可能会起作用:
re.findall
import re
regex = r"((?:\w\.){2,})"
test_str = "docs = ['U.S.','U.S.S.R. line-continued', 'u.s.s.r. line-continued', 'U.S.S.R. Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.','U.S.S.R.U.S.S.R.U.S.S.R.U.S.S.R. line-continued']"
print(re.findall(regex, test_str))
['U.S.', 'U.S.S.R.', 'u.s.s.r.', 'U.S.S.R.', 'U.S.S.R.U.S.S.R.U.S.S.R.U.S.S.R.']
re.finditer
import re
regex = r"((?:\w\.){2,})"
test_str = "docs = ['U.S.','U.S.S.R. line-continued', 'u.s.s.r. line-continued', 'U.S.S.R. Title Case', 'end-of-sentence. New-sentence', 'end-of-sentence.','U.S.S.R.U.S.S.R.U.S.S.R.U.S.S.R. line-continued']
"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
{{1}}
在this demo的右上角对表达式进行了说明,如果您想探索/简化/修改它,在this link中,您可以观察它如何与某些示例输入步骤匹配一步一步,如果您喜欢。