在Pandas数据帧

时间:2015-08-19 07:43:04

标签: python pandas

我有一个由时间戳索引的大型数据帧,我希望根据时间范围将行分配给组。

例如,在以下数据中,我将行分组在组中第一个条目的1毫秒内。

                           groupid
1999-12-31 23:59:59.000107       1
1999-12-31 23:59:59.000385       1
1999-12-31 23:59:59.000404       1
1999-12-31 23:59:59.000704       1
1999-12-31 23:59:59.001281       2
1999-12-31 23:59:59.002211       2
1999-12-31 23:59:59.002367       3

我有工作代码通过迭代行来完成此操作,并使用当前行来切片数据帧:

dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
              x in np.random.randint(1, 999999, 1000)])
df = pd.DataFrame({'groupid': None}, dts)

print df.head(20)

groupid = 1
for dt, row in df.iterrows():
    if df.loc[row.name].groupid:
        continue
    end = dt + timedelta(milliseconds=1)
    group = df.loc[dt:end]
    df.loc[group.index, 'groupid'] = groupid
    groupid += 1

print df.head(20)

然而,正如iterrows一样,大型数据帧的操作速度很慢。我已经尝试过应用函数和使用groupby,但没有成功。使用itertuples是我能做的最好的性能提升(我现在要尝试)?有人可以给点建议吗?

2 个答案:

答案 0 :(得分:3)

好的,我认为以下是你想要的,它通过用第一个值减去所有值来从你的索引构造一个TimeDelta。然后我们访问微秒组件并除以1000,然后将Series dtype转换为int:

In [86]:

df['groupid'] = ((df.index.to_series() - df.index[0]).dt.microseconds / 1000).astype(np.int32)
df
Out[86]:
                            groupid
1999-12-31 23:59:59.000133        0
1999-12-31 23:59:59.000584        0
1999-12-31 23:59:59.003544        3
1999-12-31 23:59:59.009193        9
1999-12-31 23:59:59.010220       10
1999-12-31 23:59:59.010632       10
1999-12-31 23:59:59.010716       10
1999-12-31 23:59:59.011387       11
1999-12-31 23:59:59.011837       11
1999-12-31 23:59:59.013277       13
1999-12-31 23:59:59.013305       13
1999-12-31 23:59:59.014754       14
1999-12-31 23:59:59.016015       15
1999-12-31 23:59:59.016067       15
1999-12-31 23:59:59.017788       17
1999-12-31 23:59:59.018236       18
1999-12-31 23:59:59.021281       21
1999-12-31 23:59:59.021772       21
1999-12-31 23:59:59.021927       21
1999-12-31 23:59:59.022200       22
1999-12-31 23:59:59.023104       22
1999-12-31 23:59:59.023375       23
1999-12-31 23:59:59.023688       23
1999-12-31 23:59:59.023726       23
1999-12-31 23:59:59.025397       25
1999-12-31 23:59:59.026407       26
1999-12-31 23:59:59.026480       26
1999-12-31 23:59:59.027825       27
1999-12-31 23:59:59.028793       28
1999-12-31 23:59:59.030716       30
...                             ...
1999-12-31 23:59:59.975432      975
1999-12-31 23:59:59.976699      976
1999-12-31 23:59:59.977177      977
1999-12-31 23:59:59.979475      979
1999-12-31 23:59:59.980282      980
1999-12-31 23:59:59.980672      980
1999-12-31 23:59:59.983202      983
1999-12-31 23:59:59.984214      984
1999-12-31 23:59:59.984674      984
1999-12-31 23:59:59.984933      984
1999-12-31 23:59:59.985664      985
1999-12-31 23:59:59.985779      985
1999-12-31 23:59:59.988812      988
1999-12-31 23:59:59.989324      989
1999-12-31 23:59:59.990386      990
1999-12-31 23:59:59.990485      990
1999-12-31 23:59:59.990969      990
1999-12-31 23:59:59.991255      991
1999-12-31 23:59:59.991739      991
1999-12-31 23:59:59.993979      993
1999-12-31 23:59:59.994705      994
1999-12-31 23:59:59.994874      994
1999-12-31 23:59:59.995397      995
1999-12-31 23:59:59.995753      995
1999-12-31 23:59:59.995863      995
1999-12-31 23:59:59.996574      996
1999-12-31 23:59:59.998139      998
1999-12-31 23:59:59.998533      998
1999-12-31 23:59:59.998778      998
1999-12-31 23:59:59.999915      999

感谢@Jeff指出更清洁的方法:

In [96]:
df['groupid'] = (df.index-df.index[0]).astype('timedelta64[ms]')
df

Out[96]:
                            groupid
1999-12-31 23:59:59.000884        0
1999-12-31 23:59:59.001175        0
1999-12-31 23:59:59.001262        0
1999-12-31 23:59:59.001540        0
1999-12-31 23:59:59.001769        0
1999-12-31 23:59:59.002478        1
1999-12-31 23:59:59.005001        4
1999-12-31 23:59:59.005497        4
1999-12-31 23:59:59.006908        6
1999-12-31 23:59:59.008860        7
1999-12-31 23:59:59.009257        8
1999-12-31 23:59:59.010012        9
1999-12-31 23:59:59.011451       10
1999-12-31 23:59:59.013177       12
1999-12-31 23:59:59.014138       13
1999-12-31 23:59:59.015795       14
1999-12-31 23:59:59.015865       14
1999-12-31 23:59:59.016069       15
1999-12-31 23:59:59.016666       15
1999-12-31 23:59:59.016718       15
1999-12-31 23:59:59.019058       18
1999-12-31 23:59:59.019675       18
1999-12-31 23:59:59.020747       19
1999-12-31 23:59:59.021856       20
1999-12-31 23:59:59.022959       22
1999-12-31 23:59:59.023812       22
1999-12-31 23:59:59.023938       23
1999-12-31 23:59:59.024122       23
1999-12-31 23:59:59.025332       24
1999-12-31 23:59:59.025397       24
...                             ...
1999-12-31 23:59:59.959725      958
1999-12-31 23:59:59.959742      958
1999-12-31 23:59:59.959892      959
1999-12-31 23:59:59.960345      959
1999-12-31 23:59:59.960800      959
1999-12-31 23:59:59.961054      960
1999-12-31 23:59:59.962749      961
1999-12-31 23:59:59.965681      964
1999-12-31 23:59:59.966409      965
1999-12-31 23:59:59.966558      965
1999-12-31 23:59:59.967357      966
1999-12-31 23:59:59.967842      966
1999-12-31 23:59:59.970465      969
1999-12-31 23:59:59.974022      973
1999-12-31 23:59:59.974734      973
1999-12-31 23:59:59.975879      974
1999-12-31 23:59:59.978291      977
1999-12-31 23:59:59.980483      979
1999-12-31 23:59:59.980868      979
1999-12-31 23:59:59.981417      980
1999-12-31 23:59:59.984208      983
1999-12-31 23:59:59.984639      983
1999-12-31 23:59:59.985533      984
1999-12-31 23:59:59.986785      985
1999-12-31 23:59:59.987502      986
1999-12-31 23:59:59.987914      987
1999-12-31 23:59:59.988406      987
1999-12-31 23:59:59.989436      988
1999-12-31 23:59:59.994449      993
1999-12-31 23:59:59.996657      995

答案 1 :(得分:1)

这就像重新采样操作。

创建数据

In [39]: pd.set_option('max_rows',12)

In [40]: np.random.seed(11111)

In [41]: dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
              x in np.random.randint(1, 999999, 1000)])

In [42]: df = pd.DataFrame({'groupid': np.random.randn(len(dts))}, dts)

因此,只需分组即可直接为您提供群组。你可以迭代,因为这是一个生成器。

In [43]: list(df.groupby(pd.Grouper(freq='ms')))[0:3]
Out[43]: 
[(Timestamp('1999-12-31 23:59:59', offset='L'),
                               groupid
  1999-12-31 23:59:59.000789 -1.369503
  1999-12-31 23:59:59.000814  0.776049),
 (Timestamp('1999-12-31 23:59:59.001000', offset='L'),
                               groupid
  1999-12-31 23:59:59.001041 -0.374915
  1999-12-31 23:59:59.001062 -1.470845),
 (Timestamp('1999-12-31 23:59:59.002000', offset='L'),
                               groupid
  1999-12-31 23:59:59.002355 -0.240954)]

重新取样可能更简单。您可以使用how的自定义函数。

In [44]: df.resample('ms',how='sum')
Out[44]: 
                          groupid
1999-12-31 23:59:59.000 -0.593454
1999-12-31 23:59:59.001 -1.845759
1999-12-31 23:59:59.002 -0.240954
1999-12-31 23:59:59.003  1.291403
1999-12-31 23:59:59.004       NaN
1999-12-31 23:59:59.005  0.291484
...                           ...
1999-12-31 23:59:59.994       NaN
1999-12-31 23:59:59.995       NaN
1999-12-31 23:59:59.996       NaN
1999-12-31 23:59:59.997 -0.445052
1999-12-31 23:59:59.998       NaN
1999-12-31 23:59:59.999 -0.895305

[1000 rows x 1 columns]