Question

我是Python的初学者，实际上遇到以下问题：初始点是一个类似于以下内容的数据框：

    class       plaintext
    x           [agilent, dissolution, exchange, solve, product,...]
    y           [information, data, germany, laptop, berlin...]
    z           [login, system, desk, product, solve, usb, ...]           
    x           [motioncoat, actega, actega, germany, home,...]
    z           [agilent, dissolution, exchange, solve, product,...]

我想做的是建立字典，语料库，gensim中的弓并将其用于hdp和tfidf。我遇到的问题是，我想在下课后要关注单词的主题和相关性，所以我要做的是：

df = df.groupby('class')['plaintext'].agg(list).reset_index()

然后我得到这样的东西：

    class       plaintext
    x           [[certificate, quality, ...][motioncoat, actega, actega, germany, home,...]]
    y           [information, data, germany, laptop, berlin...]
    z           [[login, system, desk, product,...][agilent, dissolution, exchange, solve,...]]

但是对于字典，我需要一个像[agilent, dissolution, exchange, solve, product,...]这样的每一行的列表我尝试了类似df['plaintext'] = [sum(x, []) for x in df['plaintext']]或循环的不同处理方式，但始终会崩溃。由于我拥有大量数据，因此我认为列表太长了。到目前为止，我使用以下代码，但是按明文（而不是类）分组。

#dic
from gensim.corpora import Dictionary
text_dictionary= Dictionary(df['plaintext'])
print(text_dictionary)

# Create Corpus of word_freq in Doc 
text_corpus = [text_dictionary.doc2bow(text) for text in df['plaintext']]
print(text_corpus[438])

有什么解决方法，如何获得按'class'分组的语料和弓形？

Dataframe中的groupby和agg命令崩溃，因为聚合的列表/ str太长

0 个答案: