我正在研究一些显示主题模型随时间分布的代码。现在,数据框看起来像这样:
doc_id date topic_dist
1 2007-01-01 [.2,.5,.3]
2 2007-03-02 [.8,.1,.1]
...
我的目标是按日期(按月,年或季度)对文档进行分组,并对数组中的每个项目进行求和(所有数组的长度相同),以创建类似于以下内容的输出: / p>
month topic_sum
2007-01 [54.8, 98.3, 61.0]
到目前为止,我试过了
year_groups = df.groupby(df['date'].map(lambda x: x.year))
output = pd.DataFrame()
output['yearly_topic_dist'] = year_groups.apply(lambda x: sum(x['topic_dist']))
所以,我无法弄清楚如何分别对数组中的每个项求和,并输出另一个数组。
答案 0 :(得分:1)
import pandas as pd
df = pd.DataFrame([[1, '2007-01-01', [.2, .5, .3]],
[2, '2007-01-02', [.8, .5, .3]]],
columns=['doc_id', 'date', 'topic_dist'])
df.date = pd.to_datetime(df.date)
df = df.set_index('date')
def topic_adder(s):
return s.apply(pd.Series).sum().tolist()
df.groupby(pd.TimeGrouper('M'))['topic_dist'].apply(topic_adder)
看起来像:
date
2007-01-31 [1.0, 1.0, 0.6]
Name: topic_dist, dtype: object
答案 1 :(得分:1)
我可能做错了什么,但@ piRSquared的解决方案似乎打破了下面的示例DataFrame,当你分组一个月。虽然它没有打破12个月的组。我想这与跨越一年的日期有关。
另一种选择是将topic_dist列转换为Numpy数组并将np.sum()应用于您的时间组:
from datetime import datetime
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, '2007-01-01', [.2, .5, .3]],
[2, '2007-01-02', [.8, .5, .3]],
[3, '2008-01-14', [0.1, 0.2, 0.3]]],
columns=['doc_id', 'date', 'topic_dist'])
df.date = pd.to_datetime(df.date)
df = df.set_index('date')
df.topic_dist = df.topic_dist.apply(lambda x: np.array(x))
您可以按一个月分组。
# Group by single months
df.groupby(pd.TimeGrouper('M'))['topic_dist'].apply(lambda x: np.sum(x))
date
2007-01-31 [1.0, 1.0, 0.6]
2007-02-28 0
2007-03-31 0
2007-04-30 0
2007-05-31 0
2007-06-30 0
2007-07-31 0
2007-08-31 0
2007-09-30 0
2007-10-31 0
2007-11-30 0
2007-12-31 0
2008-01-31 [0.1, 0.2, 0.3]
Name: topic_dist, dtype: object
或按12个月分组:
df.groupby(pd.TimeGrouper('12M'))['topic_dist'].apply(lambda x: np.sum(x))
date
2007-01-31 [1.0, 1.0, 0.6]
2008-01-31 [0.1, 0.2, 0.3]
Name: topic_dist, dtype: object
或其他时间间隔:
df.groupby(pd.TimeGrouper('5M'))['topic_dist'].apply(lambda x: np.sum(x))
date
2007-01-31 [1.0, 1.0, 0.6]
2007-06-30 0
2007-11-30 0
2008-04-30 [0.1, 0.2, 0.3]
Name: topic_dist, dtype: object