Question

早上好，

我发现多个线程处理使用多个分隔符分割字符串，但没有使用一个分隔符和多个条件。

我想用句子分割以下字符串：

desc = Anna Pytlik博士是保守和美容牙科专家。她说英语和波兰语。

如果我这样做：

[t.split（'。'）for t in desc]

我明白了：

['Dr'，'Anna Pytlik是保守和美容牙科专家'，'她说英语和波兰语。']

我不想在'Dr'之后拆分第一个点。如何添加子串列表，在这种情况下.split（'。'）不应该应用？

谢谢！

Answer 1

您可以将re.split与negative lookbehind：

一起使用

>>> desc = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. She speaks both English and Polish."
>>> re.split(r"(?<!Dr|Mr)\. ", desc)
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry',
 'She speaks both English and Polish.']

只需添加更多＆＃34;例外＆＃34;，用|分隔。

更新：看起来像负面的后视需要所有替代品都具有相同的长度，所以这不适用于＆＃34; Dr。＆＃34;和＆＃34;教授＆＃34;一种解决方法可能是使用.填充模式，例如(?<!..Dr|..Mr|Prof)。您可以轻松编写一个帮助方法，根据需要使用.填充每个标题。但是，如果文本的第一个单词是博士，则可能会破坏，因为...将不会匹配。

另一种解决方法可能是首先用一些占位符替换所有标题，例如"Dr." - ＆gt; "{DR}"和"Prof." - ＆gt; "{PROF}"，然后拆分，然后重新交换原始标题。这样你甚至不需要正则表达式。

pairs = (("Dr.", "{DR}"), ("Prof.", "{PROF}")) # and some more
def subst_titles(s, reverse=False):
    for x, y in pairs:
        s = s.replace(*(x, y) if not reverse else (y, x))
    return s

示例：

>>> text = "Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry. Prof. Miller speaks both English and Polish."
>>> [subst_titles(s, True) for s in subst_titles(text).split(". ")]
['Dr. Anna Pytlik is an expert in conservative and aesthetic dentistry', 'Prof. Miller speaks both English and Polish.']

Answer 2

你可以拆分然后再加入Dr / Mr / ... 它不需要复杂的正则表达式并且速度更快（您应该对其进行基准测试以选择最佳选项）。

使用一个分隔符但多个条件拆分字符串

2 个答案: