Question

我有元组列表，其中包含有关评论（字符串文本）和评论发布日期的信息。例如：

comments[1]

(datetime.date(2016, 8, 29),
 'I played with these ATM before but they are just too expensive way to buy bitcoins.There were a few in the city I live but many of them already stop operation, most likely because no one actually uses them.')

我有函数lda_description返回元组列表(topic, value)，topic是1和40之间的数字，返回列表长度也是1和40，例如：

lda_description(comments[1][1])

[(10, 0.43287377217078077), (14, 0.43712141484779793), (21, 0.068338146314754045)]

问题是我希望lda_description结果映射到pandas dataframe，其中有40列主题，索引是datetime。 dataframe字段值应该是特定日期每个主题的所有评论“lda_description的总和。

我有解决方案，在我看来效率不高，也许有人可以帮助我:)

#Creating empty dataframe
df = pd.DataFrame(0, index=pd.date_range(datetime.datetime(2013,12,1), datetime.datetime(2016,11,21)).tolist(),
                  columns=range(1,41))
df["count"] = 0

i = 0
for com in comments:

    if i % 50000 == 0:
        print(datetime.datetime.now(), i)
    i += 1

    topic_dist = lda_description(com[1])

    for dist in topic_dist:

        df.set_value(com[0],dist[0],
                     df.ix[com[0]][dist[0]] + dist[1])
        df.set_value(com[0],'count',
                     df.ix[com[0]]['count'] + 1)

Answer 1

我建议先将LDA值合在一起，然后使用已准备好的数据构建数据框。例如：

# sample data
import numpy as np
import pandas as pd
n = 40
dates = pd.date_range("2013-12-01", "2016-11-21")
corpus = np.repeat("foo", len(dates))

# toy function, outputs (<topic number>, <topic-membership proba>) tuples
def lda_description(doc, n):
    return list(zip(np.arange(n), np.random.random(size=n)))

# each element of data has the LDA topic-membership probability for n=40 topics
data = [[lda[1] for lda in lda_description(doc, n)] for doc in corpus]

现在只需构建数据框：

df = pd.DataFrame(data, index=dates, columns=range(1,n+1))

df.head()
                  1         2         3              39        40         
2013-12-01  0.756845  0.741939  0.334812  ...  0.383386  0.687347
2013-12-02  0.013250  0.143308  0.025458  ...  0.413655  0.581954   
2013-12-03  0.464378  0.889262  0.208653  ...  0.885814  0.685987   
2013-12-04  0.816939  0.613601  0.958807  ...  0.761439  0.758965   
2013-12-05  0.856021  0.191507  0.956722  ...  0.869742  0.543119

如果您的count列只是为了保留行的序列索引，请使用以下命令创建该列：

df.assign(count=range(len(dates)))

将元组列表中的数据转换为df

1 个答案: