熊猫数据帧处理非常慢

时间:2018-03-19 10:42:18

标签: python pandas

我正在处理一个数据集,其中包含一个日期时间列和一个我感兴趣的变量。我想要做的是将数据分组为15分钟组,所以我编写了以下代码,基本上计算了一个较低的上限日期界限并创建一个15分钟间隔的日期时间对象列表。然后对每对datetime对象之间感兴趣的变量求和,并将总和放在一个新的数据帧中。但它运行速度非常慢(处理75000行需要大约五个小时),我无法弄清楚原因。任何人都可以指出代码有什么问题吗?

如果您想自己测试代码,

here是一个小样本数据。

def create_sales_with_intervals(df, tank_id_col='tank_id'): 
    tank_id = df.iloc[0][tank_id_col]
    tank_dates = get_date_range(df)
    tank_sales =[]

    for idx in tnrange(len(tank_dates) - 1):
        t1 = tank_dates[idx]
        t2 = tank_dates[idx+1]

        sales = get_sales_between(df, t1, t2)

        row={}
        row['start_date'] = t1
        row['end_date'] = t2
        row['total_sale'] = sales
        row['tank_id'] = tank_id
        tank_sales.append(row)

    return pd.DataFrame(tank_sales, columns=['tank_id', 'start_date', 'end_date', 'total_sale'])


def get_date_range(df_tank, date_col='date_time', freq='15MIN'):
    start_date = df_tank.iloc[0][date_col]
    end_date = df_tank.iloc[-1][date_col]

    lower_bound = find_interval(start_date, 'lower')
    upper_bound = find_interval(end_date, 'upper')

    start_date_rounded = round_time(start_date, lower_bound) # Rounds the minute portion of the datetime object to nearest lower bound (0, 15, 30 , 45)
    end_date_rounded = round_time(end_date, upper_bound) # Rounds the minute portion of the datetime object to nearest upper bound (0, 15, 30 , 45)

    tank_dates = pd.date_range(start_date_rounded, end_date_rounded, freq=freq)
    return tank_dates

def get_sales_between(df, t1, t2, date_col='date_time', sale_col='sold'):
    cond1 = df[df[date_col] > t1]
    cond2 = df[df[date_col] < t2]

    idx = cond1.index & cond2.index
    total_sale = df.loc[idx.values][sale_col].sum()
    return total_sale

1 个答案:

答案 0 :(得分:1)

如果您拥有pd.DataFrame.resample()

,请考虑使用DatetimeIndex方法的以下内容
# your sample dataframe
df = pd.DataFrame(
    {
        'date_time': {0: '2015-01-02 23:18:00',
                      1: '2015-01-03 01:00:00',
                      2: '2015-01-03 02:42:00',
                      3: '2015-01-03 04:24:00',
                      4: '2015-01-03 06:06:00',
                      5: '2015-01-03 07:48:00',
                      6: '2015-01-03 09:30:00',
                      7: '2015-01-03 11:12:00',
                      8: '2015-01-03 12:54:00',
                      9: '2015-01-03 14:36:00'},
         'sold': {0: 78.3,
                  1: 0.0,
                  2: 112.9,
                  3: 13.8,
                  4: 32.0,
                  5: 95.1,
                  6: 56.4,
                  7: 28.3,
                  8: 0.0,
                  9: 0.0},
         'tank_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}
    })


df
Out[3]: 
             date_time  sold  tank_id
0  2015-01-02 23:18:00  78.3        1
1  2015-01-03 01:00:00     0        1
2  2015-01-03 02:42:00 112.9        1
3  2015-01-03 04:24:00  13.8        1
4  2015-01-03 06:06:00    32        1
5  2015-01-03 07:48:00  95.1        1
6  2015-01-03 09:30:00  56.4        1
7  2015-01-03 11:12:00  28.3        1
8  2015-01-03 12:54:00     0        1
9  2015-01-03 14:36:00     0        1

# convert your timestamps to `pd.Timestamp` objects
df['date_time'] = pd.to_datetime(df['date_time'])

# give the dataframe a `DatetimeIndex`
df.set_index('date_time', inplace=True)

df
Out[6]: 
                     sold  tank_id
date_time                         
2015-01-02 23:18:00  78.3        1
2015-01-03 01:00:00     0        1
2015-01-03 02:42:00 112.9        1
2015-01-03 04:24:00  13.8        1
2015-01-03 06:06:00    32        1
2015-01-03 07:48:00  95.1        1
2015-01-03 09:30:00  56.4        1
2015-01-03 11:12:00  28.3        1
2015-01-03 12:54:00     0        1
2015-01-03 14:36:00     0        1

# resample the `sold` column in 15 minuTe chunks and then sum each chunk
df['sold'].resample('15T').sum()
Out[8]: 
date_time
2015-01-02 23:15:00    78.3
2015-01-02 23:30:00       0
2015-01-02 23:45:00       0
2015-01-03 00:00:00       0
2015-01-03 00:15:00       0
2015-01-03 00:30:00       0
2015-01-03 00:45:00       0
2015-01-03 01:00:00       0
2015-01-03 01:15:00       0
2015-01-03 01:30:00       0
2015-01-03 01:45:00       0
2015-01-03 02:00:00       0
2015-01-03 02:15:00       0
2015-01-03 02:30:00   112.9
2015-01-03 02:45:00       0
2015-01-03 03:00:00       0
2015-01-03 03:15:00       0
2015-01-03 03:30:00       0
2015-01-03 03:45:00       0
2015-01-03 04:00:00       0
2015-01-03 04:15:00    13.8
2015-01-03 04:30:00       0
2015-01-03 04:45:00       0
2015-01-03 05:00:00       0
2015-01-03 05:15:00       0
2015-01-03 05:30:00       0
2015-01-03 05:45:00       0
2015-01-03 06:00:00      32
2015-01-03 06:15:00       0
2015-01-03 06:30:00       0

2015-01-03 07:15:00       0
2015-01-03 07:30:00       0
2015-01-03 07:45:00    95.1
2015-01-03 08:00:00       0
2015-01-03 08:15:00       0
2015-01-03 08:30:00       0
2015-01-03 08:45:00       0
2015-01-03 09:00:00       0
2015-01-03 09:15:00       0
2015-01-03 09:30:00    56.4
2015-01-03 09:45:00       0
2015-01-03 10:00:00       0
2015-01-03 10:15:00       0
2015-01-03 10:30:00       0
2015-01-03 10:45:00       0
2015-01-03 11:00:00    28.3
2015-01-03 11:15:00       0
2015-01-03 11:30:00       0
2015-01-03 11:45:00       0
2015-01-03 12:00:00       0
2015-01-03 12:15:00       0
2015-01-03 12:30:00       0
2015-01-03 12:45:00       0
2015-01-03 13:00:00       0
2015-01-03 13:15:00       0
2015-01-03 13:30:00       0
2015-01-03 13:45:00       0
2015-01-03 14:00:00       0
2015-01-03 14:15:00       0
2015-01-03 14:30:00       0
Freq: 15T, Name: sold, Length: 62, dtype: float64

您可以在pandas文档here中找到更多信息。