Question

我想优化以下代码，以便它可以有效地处理3000个文本数据，然后将这些数据馈送到TFIDF Vectorizer和linkage（）进行聚类。

到目前为止，我已经使用熊猫阅读了excel并将数据框保存到列表变量中。然后，我将列表中的每个文本元素迭代到列表中，成为标记，然后从该元素中过滤掉停用词。过滤后的元素存储在另一个变量中，并且该变量存储在列表中。因此，最后，我创建了一个已处理文本元素的列表（来自列表）。

我认为可以在创建列表，滤除停用词以及将数据保存到两个不同的变量（documents_no_stopwords和processing_words）中时进行优化。

如果有人可以在此方面帮助我或建议我遵循的方向，那将是很棒的事情。

temp=0
df=pandas.read_excel('File.xlsx')

for text in df['text'].tolist():
    temp=temp+1
    preprocessing(text)
    print temp


def preprocessing(word):

    tokens = tokenizer.tokenize(word)

    processed_words = []
    for w in tokens:
        if w in stop_words:
            continue
        else:
    ## a new list is created with only the nouns in them for each text document
            processed_words.append(w)
    ## This step creates a list of text documents with only the nouns in them
    documents_no_stopwords.append(' '.join(processed_words))
    processed_words=[]

Answer 1

您需要首先用停用词制作set，然后使用列表理解功能来过滤标记。

def preprocessing(txt):
    tokens = word_tokenize(txt)
    # print(tokens)
    stop_words = set(stopwords.words("english"))
    tokens = [i for i in tokens if i not in stop_words]

    return " ".join(tokens)

string = "Hey this is Sam. How are you?"
print(preprocessing(string))

输出：

'Hey Sam . How ?'

而不是使用for循环，而是像下面这样使用df.apply：

df['text'] = df['text'].apply(preprocessing)

为什么集合优先于列表

~~stopwords.words()中有重复的条目如果您选中len(stopwords.words())和len(set(stopwords.words())) 装置的长度小了几百个。这就是为什么这里首选set的原因。~~

这是使用list和set的效果之间的区别

x = stopwords.words('english')
y = set(stopwords.words('english'))

%timeit new = [i for i in tokens if i not in x]
# 10000 loops, best of 3: 120 µs per loop

%timeit old = [j for j in tokens if j not in y]
# 1000000 loops, best of 3: 1.16 µs per loop

此外，list-comprehension比普通for-loop快。

如何优化所有文本文档的预处理，而无需在每次迭代中使用for循环对单个文本文档进行预处理？

1 个答案: