使用字典过滤dask数据框

时间:2020-07-29 19:06:26

标签: python filter dask

我试图过滤一个淡淡的数据帧,使其仅包含由字典定义的特定时间段,其中键是ISO区域,值是时间戳列表。

这是一个经过修改的字典作为示例。

iso_region_dict = {'MISO-E':[Timestamp('2016-05-17 22:15:00'),Timestamp('2016-10-21 13:45:00'),Timestamp('2016-12-26 02:45:00')], 'CAISO':[Timestamp('2016-08-24 10:15:00'),Timestamp('2016-07-03 14:30:00'),Timestamp('2016-04-22 12:45:00')]}

我的dask数据帧看起来像这样(timeseries_ddf):

      building_id   time    electricity_cooling_kwh electricity_heating_kwh total_site_electricity_kwh  iso_zone
0   2   2016-01-01 00:15:00 0.0 0.0 4.082225    MISO-E
1   2   2016-05-17 22:15:00 0.0 0.0 5.627103    MISO-E
2   2   2016-10-21 13:45:00 0.0 0.0 21.547435   MISO-E
3   2   2016-12-26 02:45:00 0.0 0.0 4.082225    MISO-E
4   2   2016-10-21 14:00:00 0.0 0.0 21.547435   MISO-E

完整的数据帧具有数千个建筑物ID,并且“时间”列的日期时间格式为2016-1-1至2016-12-31,每个building_id的间隔为15分钟。我想过滤此数据框,使其仅在针对每个building_id的iso_region_dict中定义的time列中包括时间戳。这是一个非常大的数据框,这就是为什么我要使用dask。

所需的输出(timeseries_discharge_ddf):

building_id time    electricity_cooling_kwh electricity_heating_kwh total_site_electricity_kwh  iso_zone
    0   2   2016-05-17 22:15:00 0.0 0.0 5.627103    MISO-E
    1   2   2016-10-21 13:45:00 0.0 0.0 21.547435   MISO-E
    2   2   2016-12-26 02:45:00 0.0 0.0 4.082225    MISO-E

我已经做了一系列类似的事情,只是列出了一个时间戳记:

timeseries_discharge_ddf = timeseries_ddf.map_partitions(lambda x: x[x.time.isin(discharge_timestamps)])

我现在要尝试实现的另一个步骤是此过滤器,但是discharge_timestamps列表会根据iso_zone是什么而变化。

1 个答案:

答案 0 :(得分:0)

我认为在这里使用合并或联接会更容易。

数据

import pandas as pd
import dask.dataframe as dd

diz_df = {'building_id': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
 'time': {0: '2016-01-01 00:15:00',
  1: '2016-05-17 22:15:00',
  2: '2016-10-21 13:45:00',
  3: '2016-12-26 02:45:00',
  4: '2016-10-21 14:00:00'},
 'electricity_cooling_kwh': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
 'electricity_heating_kwh': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
 'total_site_electricity_kwh': {0: 4.082225,
  1: 5.627103,
  2: 21.547435,
  3: 4.082225,
  4: 21.547435},
 'iso_zone': {0: 'MISO-E', 1: 'MISO-E', 2: 'MISO-E', 3: 'MISO-E', 4: 'MISO-E'}}

diz_filter = {'iso_zone': {0: 'MISO-E',
  1: 'MISO-E',
  2: 'MISO-E',
  3: 'CAISO',
  4: 'CAISO',
  5: 'CAISO'},
 'time': {0: '2016-05-17 22:15:00',
  1: '2016-10-21 13:45:00',
  2: '2016-12-26 02:45:00',
  3: '2016-08-24 10:15:00',
  4: '2016-07-03 14:30:00',
  5: '2016-04-22 12:45:00'}}

df = pd.DataFrame(diz_df)
df_filter = pd.DataFrame(diz_filter)
# converting to datetime
df["time"] = df["time"].astype("M8")
df_filter["time"] = df_filter["time"].astype("M8")

使用pandas

df_out = pd.merge(df, df_filter, on=["time", "iso_zone"])

使用dask

df = dd.from_pandas(df, npartitions=2)
# It doesn't matter if the second dataframe is pandas or dask
# df_filter = dd.from_pandas(df_filter, npartitions=2)

df_out = dd.merge(df, df_filter, on=["time", "iso_zone"])