如何解决“ ValueError:设置具有序列的数组元素”

时间:2019-09-23 17:50:25

标签: python dataframe nlp tf-idf lda

这是我的数据集的一个例子

d = {'TEXT': ['History: A 59  year  old female, was sent to R/O lung nodule. Findings:  Lungs and airway:  The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size. Pleural tagging is seen. Partial encasement of subsegmental bronchi is seen.  CA lung is considered.','History: A 59  year  old woman with history of lung cancer S/P left lower lobectomy with close to pleural margin and left adrenal nodule , was sent for evaluation before post  operative RT. Findings: Comparison is made to the prior study on 03/02/2009. Chest:   The study reveals evidence of left lower lobectomy with compensatory hyperinflation of the LUL.']}
df2 = pd.DataFrame(data=d)

我想为每个句子的上下文生成实现潜在Diritchlet分配(LDA)。我已经为此训练了我的模型,并想对这些数据进行测试。

为了达到LDA的要求,我将文本标记为句子,因为我希望对每个句子进行主题分类。在句子标记化之后,我先实现TFIDF,然后再实现LDA。达到LDA时,出现此错误。以下是我的代码。

df2["sent_token"] = df2["TEXT"].apply(nltk.sent_tokenize)
vectoriser = TfidfVectorizer(tokenizer=identity_tokenizer,stop_words='english',lowercase=False)
df2['tfidf1'] = vectoriser.fit_transform(df2['sent_token'])
lda = LatentDirichletAllocation(n_components =5)
df2['tfidf_lda']= lda.fit_transform(df2['tfidf1'])

在这里我收到此错误“ ValueError:设置具有序列的数组元素”。在遇到类似错误时,ValueError: setting an array element with a sequence我发现这可能是因为行具有不同数量的句子导致了不同的长度或顺序。但这是我的异质性,我不确定是什么问题。请帮忙!

0 个答案:

没有答案