Question

我有一个由以下列组成的pandas数据框

col1, col2, _time

_time列是行及时发生的日期时间对象。

我想在两列中以10分钟周期组重新采样我的数据帧，并汇总每10分钟一段时间内发生的每组的行数。我希望生成的数据框具有以下列

col1 col2 since until count

其中since是每个10分钟句点的开始until每个10分钟时间段的结束，并计算在初始数据帧上找到的行数

col1  col2          since                  until         count
1      1       08/12/2017 12:00      08/12/2017 12:10       10
1      2       08/12/2017 12:00      08/12/2017 12:10        5
1      1       08/12/2017 12:10      08/12/2017 12:20        3

这是否可以使用数据帧的重采样方法？

Answer 1

我之前一直在考虑resample，但无济于事。幸运的是，我找到了使用pd.Series.dt.floor的解决方案！

使用.dt.floor将时间戳与10分钟间隔对齐，
在groupby中使用结果对象（或者，可选地，将其分配给源数据中的列，并使用该列）
使用pd.to_timedelta计算until列中的since列

例如，

import pandas as pd

interval = '10min'  # 10 minutes intervals, please

# Dummy data with 3-minute intervals
data = pd.DataFrame({
    'col1': [0, 0, 1, 0, 0, 0, 1, 0, 1, 1], 
    'col2': [4, 4, 4, 3, 4, 4, 3, 3, 4, 4], 
    '_time': pd.date_range(start='2010-01-01 00:01:00', freq='3min', periods=10),
})

# Floor the timestamps to your desired interval
since = data['_time'].dt.floor(interval).rename('since')

# Get the size of each group - groups are in the index of `agg`
agg = data.groupby(['col1', 'col2', since]).size()
agg = agg.rename('count')

# Back to dataframe
agg = agg.reset_index()

# Simply add your interval to `since`
agg['until'] = agg['since'] + pd.to_timedelta(interval)

print(agg)

   col1  col2               since  count               until
0     0     3 2010-01-01 00:10:00      1 2010-01-01 00:20:00
1     0     3 2010-01-01 00:20:00      1 2010-01-01 00:30:00
2     0     4 2010-01-01 00:00:00      2 2010-01-01 00:10:00
3     0     4 2010-01-01 00:10:00      2 2010-01-01 00:20:00
4     1     3 2010-01-01 00:10:00      1 2010-01-01 00:20:00
5     1     4 2010-01-01 00:00:00      1 2010-01-01 00:10:00
6     1     4 2010-01-01 00:20:00      2 2010-01-01 00:30:00

Answer 2

如果您仍在寻找答案，此示例可能会以某种方式帮助您。

import pandas as pd
import numpy as np
import datetime

# create some random data
df = pd.DataFrame(columns=["col1","col2","timestamp"])
df.col1 = np.random.randint(100, size = 10)
df.col2 = np.random.randint(100, size = 10)
df.timestamp = [datetime.datetime(2000,1,1) + \
            datetime.timedelta(hours=int(i)) for i in np.random.randint(100, size = 10)]

# sort data by timestamp and reset index
df = df.sort_values(by="timestamp").reset_index(drop=True)

# create the bins by taking last first time and last time with freq 6h
bins = pd.date_range(start=df.timestamp.values[0],end=df.timestamp.values[-1], freq="6h") # change to reasonable freq (d, h, m, s) 
# zip them to pairs
startend =  list(zip(bins, bins.shift(1)))

# define a function that finds bin index
def time_in_range(x):
    """Return true if x is in the range [start, end]"""
    for ind,(start,end) in enumerate(startend):
        if start <= x <= end:
            return ind


# Add bin index to column named index
df['index'] = df.timestamp.apply(time_in_range)
# groupby index to find sum and count
df = df.groupby('index')["col1","col2"].agg(['sum','count']).reset_index()


# Create output df2 (with bins)        
df2 = pd.DataFrame(startend, columns=["start","end"]).reset_index()

# Join the two dataframes with column index
df3 =pd.merge(df2, df, how='outer', on='index').fillna(0)

# Final adjustments
df3.columns = ["index","start","end","col1","delete","col2","count"]
df3.drop(['delete','index'], axis=1, inplace=True)

输出：

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>start</th>      <th>end</th>      <th>col1</th>      <th>col2</th>      <th>count</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>2000-01-01 21:00:00</td>      <td>2000-01-02 03:00:00</td>      <td>89.0</td>      <td>136.0</td>      <td>2.0</td>    </tr>    <tr>      <th>1</th>      <td>2000-01-02 03:00:00</td>      <td>2000-01-02 09:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>2</th>      <td>2000-01-02 09:00:00</td>      <td>2000-01-02 15:00:00</td>      <td>69.0</td>      <td>27.0</td>      <td>1.0</td>    </tr>    <tr>      <th>3</th>      <td>2000-01-02 15:00:00</td>      <td>2000-01-02 21:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>4</th>      <td>2000-01-02 21:00:00</td>      <td>2000-01-03 03:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>5</th>      <td>2000-01-03 03:00:00</td>      <td>2000-01-03 09:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>6</th>      <td>2000-01-03 09:00:00</td>      <td>2000-01-03 15:00:00</td>      <td>108.0</td>      <td>57.0</td>      <td>2.0</td>    </tr>    <tr>      <th>7</th>      <td>2000-01-03 15:00:00</td>      <td>2000-01-03 21:00:00</td>      <td>35.0</td>      <td>85.0</td>      <td>2.0</td>    </tr>    <tr>      <th>8</th>      <td>2000-01-03 21:00:00</td>      <td>2000-01-04 03:00:00</td>      <td>102.0</td>      <td>92.0</td>      <td>2.0</td>    </tr>    <tr>      <th>9</th>      <td>2000-01-04 03:00:00</td>      <td>2000-01-04 09:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>10</th>      <td>2000-01-04 09:00:00</td>      <td>2000-01-04 15:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>11</th>      <td>2000-01-04 15:00:00</td>      <td>2000-01-04 21:00:00</td>      <td>91.0</td>      <td>3.0</td>      <td>1.0</td>    </tr>  </tbody></table>

在10分钟的样品中

2 个答案: