Question

我有一个大约200,000行的数据框，每行大约有30个标记词。我正在尝试纠正拼写错误，然后对其进行形容。

某些单词不在词典中，因此，如果它们出现的频率太高，我只会通过该单词，否则，我会对其进行纠正。

spell = SpellChecker() 
def spelling_mistake_corrector(word):
    checkedWord = spell.correction(word)
    if freqDist[checkedWord] >= freqDist[word]:
    word = checkedWord
return word

def correctorForAll(text):
    text = [spelling_mistake_corrector(word) for word in text]
    return text

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    text = [lemmatizer.lemmatize(word) for word in text]
    text = [word for word in text if len(word) > 2] #filtering 1 and 2 letter words out
    return text

def apply_corrector_and_lemmatizer(text):
    return lemmatize_words(correctorForAll(text))

df['tokenized'] = df['tokenized'].apply(apply_corrector_and_lemmatizer)

问题是：该代码在colab上运行了3个小时，我该怎么做才能缩短运行时间？谢谢！

在python中进行拼写检查的时间太多

0 个答案: