Question

我正在尝试删除法语和英语中的停用词。到目前为止，我一次只能删除一种语言的停用词。我有一个文本文件，其中包含700行法语和英语混合的文本。

我正在使用Python进行这700行的集群项目。但是，问题出在我的集群上。我的群集中满是法语停用词，这使我的群集的效率大打折扣。

这是我的停用词代码：

stopwords = nltk.corpus.stopwords.words('english')

如上所述，我也尝试在其中包含“法语”停用词，但无法在一行代码或同一变量中添加。

这是包含我的文件的代码，其中包含我的700行法语和英语的混合说明：

Description2 = df['Description'].str.lower().apply(lambda x: ' 
'.join([word for word in str(x).split() if word not in (stopwords)]))

我尝试在上面的代码行中添加2个停用词变量，但它仅删除了第一个变量的停用词。

以下是由于未删除法语停用词而得到的一个群集示例：

Cluster 5:
 la
 et
 dans
 les
 des
 est
 du
 le
 une
 en

如果我能够从文档中删除法语停用词，那么我将能够拥有代表我的文档中重复出现的实际单词的簇。

任何帮助将不胜感激。谢谢。

Answer 1

您是否尝试仅将法语停用词添加到英语停用词中？例如，这种方式（并使用set()中提到的nltk tutorial来提高效率）：

stopwords = set(nltk.corpus.stopwords.words('english')) | set(nltk.corpus.stopwords.words('french'))
# This way, you've got the english and french stop words in the stopwords variable

Description2 = df['Description'].str.lower().apply(lambda x: ' '.join([word for word in str(x).split() if word not in stopwords]))

Answer 2

怎么样：

import nltk
import pandas as pd
from functools import reduce


df = pd.DataFrame(data={'Description': ['hello', 'dupa']})

def apply_filtering(val, df):
    df['Description'] = df['Description'].str.lower()
    df['Description'] = df['Description'].apply(lambda x: str(x).split())
    df['Description'] = (df['Description']
                         .apply(lambda x: [word for word in x if word not in (nltk.corpus.stopwords.words(val))])
                         )
    df['Description'] = df['Description'].apply(lambda x: ''.join(x))
    return df


elo = lambda l: reduce(lambda y,x: apply_filtering(x,y), l, df)
elo(['english', 'french'])

删除法文和英文的停用词

2 个答案: