Question

我有一个熊猫数据框，其中有一栏包含会话数据。我通过以下方式对其进行了预处理：

def preprocessing(text):
     return [word for word in simple_preprocess(str(text), min_len = 2, deacc = True) if word not in stop_words]

dataset['preprocessed'] = dataset.apply(lambda row: preprocessing(row['msgText']), axis = 1)

要使其一维使用（两者）：

processed_docs = data['preprocessed']

以及：

processed_docs = data['preprocessed'].tolist()

现在看起来如下：

>>> processed_docs[:2]
0    ['klinkt', 'alsof', 'zwaar', 'dingen', 'spelen...
1    ['waar', 'liefst', 'meedenk', 'betekenen', 'pe...

对于这两种情况，我都使用了：

dictionary = gensim.corpora.Dictionary(processed_docs)

但是，在两种情况下我都收到错误消息：

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

如何修改我的数据，以免出现此TypeError错误？

考虑到以前曾经问过类似的问题，我考虑过：

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

基于第一个答案，我尝试了以下解决方案：

dictionary = gensim.corpora.Dictionary([processed_docs.split()])

并得到错误（/ s）：

AttributeError: 'Series'('List') object has no attribute 'split'

在第二个答案中，有人说输入必须是令牌，这已经对我有用。

此外，基于（TypeError: doc2bow expects an array of unicode tokens on input, not a single string when using gensim.corpora.Dictionary()），我使用了如上所述的.tolist()方法，该方法也不起作用。

Answer 1

问题是很久以前发布的，但对于任何仍在想的人。 Pandas将列表存储为字符串，因此存储为TypeError，将这种字符串解释为列表的一种方式是使用：

Date() + timeInterval

然后：

from ast import literal_eval

Answer 2

我认为您需要：

dictionary = gensim.corpora.Dictionary([processed_docs[:]])

遍历集合。您可以写[2：]从2开始并循环到结尾，或者写[：7]从0开始，然后转到7或[2：7]。您也可以尝试[：len（processed_docs）]

我希望这会有所帮助：）

如何在Gensim词典中输入由不同标记组成的系列/列表？

2 个答案: