我有元组列表,其中包含有关评论(字符串文本)和评论发布日期的信息。例如:
comments[1]
(datetime.date(2016, 8, 29),
'I played with these ATM before but they are just too expensive way to buy bitcoins.There were a few in the city I live but many of them already stop operation, most likely because no one actually uses them.')
我有函数lda_description
返回元组列表(topic, value)
,topic
是1和40之间的数字,返回列表长度也是1和40,例如:
lda_description(comments[1][1])
[(10, 0.43287377217078077), (14, 0.43712141484779793), (21, 0.068338146314754045)]
问题是我希望lda_description
结果映射到pandas dataframe,其中有40列主题,索引是datetime。 dataframe字段值应该是特定日期每个主题的所有评论“lda_description
的总和。
我有解决方案,在我看来效率不高,也许有人可以帮助我:)
#Creating empty dataframe
df = pd.DataFrame(0, index=pd.date_range(datetime.datetime(2013,12,1), datetime.datetime(2016,11,21)).tolist(),
columns=range(1,41))
df["count"] = 0
i = 0
for com in comments:
if i % 50000 == 0:
print(datetime.datetime.now(), i)
i += 1
topic_dist = lda_description(com[1])
for dist in topic_dist:
df.set_value(com[0],dist[0],
df.ix[com[0]][dist[0]] + dist[1])
df.set_value(com[0],'count',
df.ix[com[0]]['count'] + 1)
答案 0 :(得分:0)
我建议先将LDA值合在一起,然后使用已准备好的数据构建数据框。例如:
# sample data
import numpy as np
import pandas as pd
n = 40
dates = pd.date_range("2013-12-01", "2016-11-21")
corpus = np.repeat("foo", len(dates))
# toy function, outputs (<topic number>, <topic-membership proba>) tuples
def lda_description(doc, n):
return list(zip(np.arange(n), np.random.random(size=n)))
# each element of data has the LDA topic-membership probability for n=40 topics
data = [[lda[1] for lda in lda_description(doc, n)] for doc in corpus]
现在只需构建数据框:
df = pd.DataFrame(data, index=dates, columns=range(1,n+1))
df.head()
1 2 3 39 40
2013-12-01 0.756845 0.741939 0.334812 ... 0.383386 0.687347
2013-12-02 0.013250 0.143308 0.025458 ... 0.413655 0.581954
2013-12-03 0.464378 0.889262 0.208653 ... 0.885814 0.685987
2013-12-04 0.816939 0.613601 0.958807 ... 0.761439 0.758965
2013-12-05 0.856021 0.191507 0.956722 ... 0.869742 0.543119
如果您的count
列只是为了保留行的序列索引,请使用以下命令创建该列:
df.assign(count=range(len(dates)))