Question

我在pandas数据框中有一个列，其中每个单元格包含一个相当长的单词串。这些字符串来自SQL数据库，包含单词和字母数字id短语，它们不是英语，用空格分隔。这些字符串最多可以是SQL的字符max。这也不是一个小数据帧，我有几百万行。

问题是，为每个单元格保留正确的英语单词的最快方法是什么？

下面是我的初始方法，根据tqdm建议的速度（因此是progress_apply），似乎需要数天才能完成。

import pandas as pd
from nltk.corpus import words
from tqdm import tqdm

def check_for_word(sentence):
    s = sentence.split(' ')
    for word in s:
        if word not in words.words():
            s.remove(word)
    return ' '.join(s)

tqdm.pandas(desc="Checking for Words in keywords")
df['keywords'] = df['keywords'].progress_apply(check_for_word)

有没有一种明显更快的方法？

感谢您的帮助！

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

下面的答案非常有用，运行时间不到一秒钟（伟大的改进！）。最后，我不得不从nltk.corpus单词更改为nltk.corpus单词网，因为单词并非详尽无遗地列出了我的目的。最终的结果是：

from nltk.corpus import wordnet
from tqdm import tqdm

def check_for_word(s):
    return ' '.join(w for w in str(s).split(' ') if len(wordnet.synsets(w)) > 0)

tqdm.pandas(desc="Checking for Words in Keywords")
df['keywords'] = df['keywords'].progress_apply(check_for_word)

需要43秒才能运行。

Answer 1

words.words()返回一个列表，该列表需要O(n)时间来检查列表中是否存在单词。为了优化时间复杂度，您可以创建此列表中的集合，该列表提供恒定时间搜索第二个优化是列表上的remove()方法花费O(n)时间。您可以维护单独的列表以消除该开销。要了解有关各种操作的复杂性的更多信息，请参阅https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt

set_of_words = set(words.words())

def check_for_word(sentence):
    s = sentence.split(' ')
    return ' '.join(w for word in s if word in set_of_words)

如何快速检查字符串是否有正确的英文单词？ - Python

1 个答案: