Question

我希望将字符串中的所有单引号替换为double，但不包括“not”，“ll”，“m”等事件。

input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""

代码1：（@ https://stackoverflow.com/users/918959/antti-haapala）

def convert_regex(text): 
     return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)

有3种情况：'不在前面，后面跟不是字母数字字符;或者之前没有，但后面跟着一个字母数字字符;或者之前是字母数字字符，而不是字母数字字符。

问题：这对以撇号结尾的单词不起作用，即最具占有性的复数，也不适用于非正式的以撇号开头的缩写。

代码2：（@ https://stackoverflow.com/users/953482/kevin）

def convert_text_func(s):
    c = "_" #placeholder character. Must NOT appear in the string.
    assert c not in s
    protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
    for k,v in protected.iteritems():
        s = s.replace(k,v)
    s = s.replace("'", '"')
    for k,v in protected.iteritems():
        s = s.replace(v,k)
    return s

要指定的单词太多，如何指定人员等。请帮忙。

编辑1： 我正在使用@ anubhava的明智答案。我正面临这个问题。有时，语言翻译会导致方法失败。代码=

text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)

问题：

在文中，'Kumbh melas'melas是一个印地语到英语的翻译，而不是复数的所有格名词。

Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,

我希望可能会添加以某种方式修复它的条件。人力干预是最后的选择。

编辑2： 天真而漫长的修复方法：

def replace_translations(text):
    d = enchant.Dict("en_US")
    words=tokenize_words(text)
    punctuations=[x for x in string.punctuation]
    for i,word in enumerate(words):
        print i,word
        if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
            text=text.replace(words[i]+words[i+1],words[i]+"\"")
    return text

我是否遗失了任何角落案件，或者有更好的方法吗？

Answer 1

首次尝试

您也可以使用此正则表达式：

(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))

DEMO IN REGEX101

此正则表达式将整个句子/单词与开头和结尾两个引号匹配，但也可以在组nr 1中保留引用内容，因此您可以将匹配的部分替换为"\1"。

(?<!\w) - 非单词字符的负面后瞻，排除像：＆＃34;你＆＃39; ll＆＃34;等等，但允许正则表达式匹配像{之类的字符后的quatations {1}}，\n，:，;或.等。假设在报价之前总是有空格是有风险的。
- - 单引号，
' - 非捕获组：任何一个或多个角色中的一个或多个新行（以匹配多行句子）与懒惰量化（以避免从第一个到最后一个单引号匹配），然后是可选单引号唱歌，如果连续两行
(?:.|\n)+?'?) - 单引号，后跟非单词字符，以排除文字就像＆＃34;我＆＃34; m＆＃34;，＆＃34;你＆＃34;＆＃34;等等，引号是beetwen words，

s＆＃39;情况下

然而，在以s结尾的单词之后，仍然存在匹配句子与撇号的问题，例如：'(?!\w)。我认为当'the classes' hours'后跟s应该被视为引用结束时，或者作为带有撇号的'时，我无法区分正则表达式。但我发现了一个有限的解决这个问题的工作，正则表达式：

DEMO IN REGEX101

PYTHON IMPLEMENTATION

对(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w)))) s'案例的其他替代方案：(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w)其中：

(?<!s)'(?!\w) - 如果s之前没有'，则匹配上面的正则表达式（首次尝试），
(?<=s)'(?!([^']|\w'\w)+'(?!\w) - 如果s之前有'，则只有在没有其他'后跟非字的情况下，才能在此'上结束匹配以下文字中的字符，在结尾之前或在另一个'之前（但只有'前面跟s以外的字母，或者打开下一个配额）。 \w'\w在此类匹配中包含'字母之间的字母，如i'm等。

这个正则表达式应该匹配错误，只有连续几个s'个案例。尽管如此，它还远非完美的解决方案。

\ w

的缺陷

此外，使用\w总是偶然发生'在sybol或非[a-zA-Z_0-9]之后但仍然是字母字符，就像某些本地语言字符一样，然后它将被处理作为一个quatation的开始。将(?<!\w)和(?!\w)替换为(?<!\p{L})和(?!\p{L})或类似(?<=^|[,.?!)\s])等等，可以避免这种情况，对句子中可能出现的字符进行正面观察在quatation之前。但是列表可能会很长。

Answer 2

您可以使用：

input="I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub(r"(?<!s)'(?!(?:t|ll|e?m)\b)", '"', input)

<强>输出：

I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.

RegEx Demo

Answer 3

试试这个：您可以使用此正则表达式((?<=\s)'([^']+)'(?=\s))并替换为"\2"

import re
p = re.compile(ur'((?<=\s)\'([^\']+)\'(?=\s))')
test_str = u"I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
subst = u"\"\2\""

result = re.sub(p, subst, test_str)

<强>输出

I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.

<强> Demo

Answer 4

这是一种非正则表达方式

text="the stackoverflow don't said, 'hey what'"

out = []
for i, j in enumerate(text):
    if j == '\'':
        if text[i-1:i+2] == "n't" or text[i:i+3] == "'ll" or text[i:i+3] == "'m":
            out.append(j)
        else:
            out.append('"')
    else:
        out.append(j)

print ''.join(out)

作为输出

the stackoverflow don't said, "hey what"

当然，您可以改进排除列表，而不必手动检查每个排除...

将单引号替换为double，但不包括某些元素

4 个答案:

首次尝试

s＆＃39;情况下

\ w