我有一个大约200,000行的数据框,每行大约有30个标记词。我正在尝试纠正拼写错误,然后对其进行形容。
某些单词不在词典中,因此,如果它们出现的频率太高,我只会通过该单词,否则,我会对其进行纠正。
spell = SpellChecker()
def spelling_mistake_corrector(word):
checkedWord = spell.correction(word)
if freqDist[checkedWord] >= freqDist[word]:
word = checkedWord
return word
def correctorForAll(text):
text = [spelling_mistake_corrector(word) for word in text]
return text
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
text = [lemmatizer.lemmatize(word) for word in text]
text = [word for word in text if len(word) > 2] #filtering 1 and 2 letter words out
return text
def apply_corrector_and_lemmatizer(text):
return lemmatize_words(correctorForAll(text))
df['tokenized'] = df['tokenized'].apply(apply_corrector_and_lemmatizer)
问题是:该代码在colab上运行了3个小时,我该怎么做才能缩短运行时间?谢谢!