Question

我有一些来自网络的文字，但人们用简短的形式写了它们，比如大学的uni，以及awesome等的awsm，但我可以猜出这些单词的列表。但是如何用Python纠正它们呢？我尝试了以下但是没有用。

APPOSTOPHES= {"'s": "is", "'re":"are"}    
s= " i luv my iphone, you're awsm apple. DisplayisAwesome, Sooooo happppppy"
words = s.split()
rfrm=[APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]
rfrm= " ".join(rfrm)
print(rfrm)

i luv my iphone, you're awsm apple. DisplayisAwesome, Sooooo happppppy

但它会打印相同的句子。它没有改变任何东西。

Answer 1

您的代码有一些问题，第一个问题是您在APPOSTOPHES[word]支票中未与任何候选替代品匹配。

我以非常清晰的方式打破了代码并在你的APPOSTROPHES词典中做了一个小修正 - 注意现在在值中的空间。其余描述在代码注释中：

APPOSTOPHES= {"'s": " is", "'re":" are"}    
test_string = " i luv my iphone, you're awsm apple. DisplayisAwesome, Sooooo happppppy"

# split the words based on whitespace
sentence_list = test_string.split()

# make a place where we can build our new sentence
new_sentence = []

# look through each word 
for word in sentence_list:
    # look for each candidate
    for candidate_replacement in APPOSTOPHES:
        # if our candidate is there in the word
        if candidate_replacement in word:
            # replace it 
            word = word.replace(candidate_replacement, APPOSTOPHES[candidate_replacement])

    # and pop it onto a new list 
    new_sentence.append(word)

rfrm = " ".join(new_sentence)
print(rfrm)
# i luv my iphone, you are awsm apple. DisplayisAwesome, Sooooo happppppy

编辑：正如Alexis的评论所提出的那样，如果您尝试对所有内容应用相同的模式，则单词/缩小替换将导致麻烦。我采用这种方法，因为你的变量名称接近单词＆＃34;撇号＆＃34; - 那就是我们正在改变的东西。他建议使用nltk tokenize方法是一个很好的方法;如果您要将您的方法建立在库上，那么一定要学习它的首选方法。

我的回答是为了让你超越你的直接障碍，并告诉你为什么你得到相同的句子字符串作为回应。

如何使用python或Nltk纠正俚语？

1 个答案: