我有一个非常大的分钟时间序列数据集(持续3个月),格式如下:
datetime,val1,val2,val3,val4,val5,val6,val7,val8,val9,val10,val11,val12
1/06/2017 0:00,0,0,0,0,0,0,0,0,0,0.011,0,0.036
1/06/2017 0:01,0,0,0,0,0,0,0,0,0,0.011,0,0.036
...
1/06/2017 23:59,0,0,0,0,0,0,0,0,0,0.011,0,0.035
2/06/2017 0:00,0,0,0,0,0,0,0,0,0,0.014,0,0.036
2/06/2017 0:01,0,0,0,0,0,0,0,0,0,0.011,0,0.036
...
2/06/2017 23:59,0,0,0,0,0,0,0,0,0,0.011,0,0.035
....
31/08/2017 0:00,0,0.2,0,0,0,0.56,0,0,0,0.014,0,0.036
31/08/2017 0:01,0,0.23,0,0,0,0,0,0,0,0.011,0,0.032
...
31/08/2017 23:59,0,0,0,0,0,0,.55,0,0,0.011,0,0.034
使用熊猫获取每月每一列的平均值的最有效方法是什么? 预期的输出将是
month,val1,val2,val3,val4,val5,val6,val7,val8,val9,val10,val11,val12
06/2017,0,0,0,0,0,0,0,0,0,0.011,0,0.036
07/2017,0,0,0,0,0,0,0,0,0,0.014,0,0.036
08/2017,0,0,0.21,0,0,0,0,0.52,0,0.011,0,0.036
目前,我正在做的事情是每天读取数据集,然后获取累积的天数数据集,然后将其除以每月的天数。但这效率很低并且要花费很多时间。
答案 0 :(得分:1)
在月份开始时,先按to_datetime
转换列,然后按 @code_warntype loadfile2(Float32)
Body::Any
9 1 ─ %1 = %new(getfield(Main, Symbol("##842#843")){DataType}, T)::getfield(Main, Symbol("##842#843")){DataType} │
│ %2 = Main.open::Core.Compiler.Const(open, false) │
│ %3 = invoke Base.:(#open#294)($(QuoteNode(Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}()))::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, %2::Function, %1::getfield(Main, Symbol("##842#843")){DataType}, "file.txt"::String)::Any
└── return %3
转换DataFrame.resample
,最后按DatetimeIndex.strftime
将DatetimeIndex的格式更改为loadfile2
:>
MS
或将日期时间的转换列通过Series.dt.strftime
传递到MM/YYY
并汇总df['datetime'] = pd.to_datetime(df['datetime'], format='%d/%m/%Y %H:%M')
df = df.resample('MS', on='datetime').mean()
df.index = df.index.strftime('%m/%Y')
print (df)
val1 val2 val3 val4 val5 val6 val7 val8 val9 \
06/2017 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0
07/2017 NaN NaN NaN NaN NaN NaN NaN NaN NaN
08/2017 0.0 0.143333 0.0 0.0 0.0 0.186667 0.183333 0.0 0.0
val10 val11 val12
06/2017 0.0115 0.0 0.035667
07/2017 NaN NaN NaN
08/2017 0.0120 0.0 0.034000
:
groupby
答案 1 :(得分:1)
熊猫read_csv
和to_csv
是您所需要的:
df = pd.read_csv('input.csv', parse_dates=['datetime'])
df.groupby(df.datetime.dt.strftime('%m/%Y')).mean().rename_axis('month').to_csv(out, float_format='%.06f')
使用您的输入数据(从...过滤掉)可以得出:
month,val1,val2,val3,val4,val5,val6,val7,val8,val9,val10,val11,val12
01/2017,0,0.000000,0,0,0,0.000000,0.000000,0,0,0.011000,0,0.035667
02/2017,0,0.000000,0,0,0,0.000000,0.000000,0,0,0.012000,0,0.035667
08/2017,0,0.143333,0,0,0,0.186667,0.183333,0,0,0.012000,0,0.034000