Dask数据框分组和应用自定义功能

时间:2020-08-03 05:21:06

标签: python dask

我有这个功能,我正在尝试将其应用于假设某些存储容量和速率限制的冷却计算的dask数据框。建筑物使用制冷的时间步长为15分钟,并返回一定的存储速率可以容纳的量。

   def cooling_kwh_by_case(row, storage_capacity, storage_rate):
    if ((row['daily_cooling_kwh'] <= storage_capacity/row['cop']) & (row['max_cooling_kw'] <= storage_rate/row['cop'])):
        return row['daily_cooling_kwh']
    elif ((row['daily_cooling_kwh'] <= storage_capacity/row['cop']) & (row['max_cooling_kw'] > storage_rate/row['cop'])):
        daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: sum(min(x,storage_rate/(4*row['cop']))))
        return daily_groupby.loc[(row.building_date)]
    else:
        n_largest = 1
        daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
        while ((daily_groupby.loc[(row.building_date)]) <= (storage_capacity/row['cop'])) & (n_largest < net_load_w_times.groupby('index')['electricity_cooling_kwh'].count()):   
            n_largest += 1
            daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
        return min(storage_capacity/row['cop'],net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest-1).sum()).loc[(row.building_date)])

应用它时,这是我的错误消息。

        <ipython-input-22-88e243d194c6> in cooling_kwh_by_case()
         16         n_largest = 1
         17         daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
    ---> 18         while ((daily_groupby.loc[(row.building_date)]) <= (storage_capacity/row['cop'])) & (n_largest < net_load_w_times.groupby('index')['electricity_cooling_kwh'].count()):
         19             n_largest += 1
         20             daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())

ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

我认为我遇到的问题是尝试为else语句计算所需值的方式,在这种情况下,冷却kwh大于storage_capacity参数。为了计算该值,我应用了一个函数来查找当日最大15分钟冷却kwh值的总和是否超过storage_capacity。然后,我返回最大值的总和。

我在函数中试图分组以返回值的数据帧称为net_load_w_times:

                          time  electricity_cooling_kwh  \
building_id                                                
2           2016-07-05 19:00:00                 0.050000   
2           2016-07-05 22:00:00                 3.200000   
2           2016-07-05 16:00:00                 5.779318   
2           2016-07-05 20:00:00                 1.888300   
2           2016-07-05 18:00:00                 7.490000  

             electricity_heating_kwh  total_site_electricity_kwh iso_zone  \
building_id                                                                 
2                           0.000000                   19.529506   MISO-E   
2                           0.045235                    6.310719   MISO-E   
2                           0.000000                   22.514705   MISO-E   
2                           0.018624                   13.474863   MISO-E   
2                           0.005464                   18.192927   MISO-E   

                    index        date  
building_id                            
2            2|2016-10-24  2016-10-24  
2            2|2016-03-05  2016-03-05  
2            2|2016-08-14  2016-08-14  
2            2|2016-03-05  2016-03-05  
2            2|2016-03-05  2016-03-05  

 

所需的输出:

给出cooling_kwh_by_case(row, 8, 5),它会输出:

7.717618,因为这是最大冷却功率,最多可以达到8。

1 个答案:

答案 0 :(得分:0)

Dask数据帧是惰性的,无法在控制流中工作,例如if-else语句或for循环。我建议尝试在pandas API中查找解决方案,例如where方法。