我有一个数据框架,其中包含每位用户的天数和下载量:
dates downloadsperuser
2004-01-02 12.51118760757315
2004-01-03 6.990049751243781
2004-01-04 6.8099547511312215
2004-01-05 22.513349514563107
2004-01-06 22.348538011695908
2004-01-07 23.895180722891567
2004-01-08 21.765680473372782
2004-01-09 20.34256926952141
2004-01-10 9.455938697318008
...
2004-02-01 9.196078431372548
2004-02-02 21.558398220244715
2004-02-03 22.293007769145394
2004-02-04 22.324115044247787
2004-02-05 21.88482834994463
2004-02-06 20.236781609195404
2004-02-07 10.708823529411765
2004-02-08 10.835329341317365
2004-02-09 24.87350054525627
2004-02-10 24.167035398230087
2004-02-11 22.676117775354417
2004-02-12 23.384444444444444
2004-02-13 20.674285714285713
2004-02-14 10.74914089347079
2004-02-15 11.64873417721519
...
2004-03-01 23.36965811965812
2004-03-02 23.127545551982852
2004-03-03 23.60235798499464
2004-03-04 23.634015069967706
2004-03-05 20.468996617812852
2004-03-06 6.608208955223881
2004-03-07 5.570446735395189
2004-03-08 23.48093220338983
2004-03-09 25.734190782422292
2004-03-10 24.919652551574377
...
我想计算平均平均值。到目前为止,我尝试过:
df = pd.read_csv('downloadsperuser.csv', parse_dates=True)
df['dates']=pd.to_datetime(df['dates'])
df['month'] = pd.PeriodIndex(df.dates, freq='M')
df['month'].value_counts().sort_index()
并成为一天中的月份。但是我不知道如何每月汇总downloadsperuser
列中的所有值。
答案 0 :(得分:2)
您可以尝试:
# test input
set.seed(123)
x <- sample(20, 20)
d <- c(.2, .3, .5) # assume in increasing order
o <- order(x)
b <- findInterval(cumsum(d) * sum(x), cumsum(x[o]))
g <- rep(seq_along(d), diff(c(0, b)))[order(o)]
# check distribution of result
tapply(x, g, sum) / sum(x)
## 1 2 3
## 0.1714286 0.3285714 0.5000000
答案 1 :(得分:2)
首先计算月份和年份,然后分组依据以找到均值:
df['month'] = pd.to_datetime(df['date']).dt.month
df['year'] = pd.to_datetime(df['date']).dt.year
df.groupby(['year','month'],as_index=False).mean()