熊猫重新采样和插值功能太慢

时间:2018-08-30 08:18:37

标签: python pandas performance

我有一个用例,其中我对从10个json对象的列表创建的小数据帧进行了重新采样。数据框中有10行50列,其中缺少20%的字段。重新采样后,我按列逐列插值数据框,以选择用户定义的插值方法。这样做的代码如下:

df = pd.DataFrame(packets_dict)
df = df.set_index('datetime')
df = df.resample('60S').first()
for column in columns_rule:
    if column in df.columns:
        # replace value by null if it is out of given min and max values.
        if 'max_value' in columns_rule[column].keys():
            df[column] = df[column].where(df[column] < columns_rule['column']['max_value'])
        if 'min_value' in columns_rule[column].keys():
            df[column] = df[column].where(df[column] > column_rule[column]['min_value'])
        df[column] = df[column].interpolate(method=linear, limit=3)

我需要在高速率的流数据上运行此代码。但是这段代码执行时间太长。我已经对该代码进行了概要分析,通过运行此代码990次显示了以下结果。

重新采样需要137.347秒,被调用990次,一次函数调用需要138.79毫秒。

其中,其中75.272耗时数秒,被调用87120次,一次函数调用为0.864毫秒。

内插需要21.928秒,被调用了43560次,一次函数调用为0.503 ms。

其余的代码非常快,并且不需要很多时间。 这些功能完成990次迭代所需的总时间为234.5秒,几乎是4分钟,这比我们要求的速度要高得多。我需要优化此代码,以将时间减少20倍至30倍。有什么方法可以优化这些熊猫功能,或者在使用这些功能时我做错了什么。

我正在将Pandas 0.23.0与python3配合使用。

我已经搜索过,但是找不到任何解决方案。请帮我解决您的意见和建议。

如果有人想尝试,这里是一个示例数据: 它不是实际的数据,因为它不能共享,但是我提供的数据在浮点数,整数和字符串方面非常重要,而且在列数和行数方面也是如此。此外,对于字符串列,我使用ffill或bfill而不是插值函数。

packets_dict = [
{'datetime':"2018-08-01 22:05:40",'field1':12.4,'field2':'aab','field3':1234,'field4':12.4,'field5':'aab','field6':1234,'field7':12.4,'field8':'aab','field9':1234,'field10':12.4,'field11':'aab','field12':1234,'field13':12.4,'field14':'aab','field15':1234,'field16':12.4,'field17':'aab','field18':1234,'field19':12.4,'field20':'aab','field21':1234,'field22':12.4,'field23':'aab','field24':1234,'field25':12.4,'field26':'aab','field27':1234,'field28':12.4,'field29':'aab','field30':1234,'field31':12.4,'field32':'aab','field33':1234,'field34':12.4,'field35':'aab','field36':1234,'field37':12.4,'field38':'aab','field39':1234,'field40':12.4,'field41':'aab','field42':1234,'field43':12.4,'field44':'aab','field45':1234,'field46':12.4,'field47':'aab','field48':1234,'field49':12.4},
{'datetime':"2018-08-01 22:06:41",'field10':12.4,'field11':'aab','field12':1234,'field13':12.4,'field14':'aab','field15':1234,'field16':12.4,'field17':'aab','field18':1234,'field19':12.4,'field20':'aab','field21':1234,'field22':12.4,'field23':'aab','field24':1234,'field25':12.4,'field26':'aab','field27':1234,'field28':12.4,'field29':'aab','field30':1234,'field31':12.4,'field32':'aab','field33':1234,'field34':12.4,'field35':'aab','field36':1234,'field37':12.4,'field38':'aab','field39':1234,'field40':12.4,'field41':'aab','field42':1234,'field43':12.4,'field44':'aab','field45':1234,'field46':12.4,'field47':'aab','field48':1234,'field49':12.4},
{'datetime':"2018-08-01 22:07:42",'field1':12.4,'field2':'aab','field3':1234,'field4':12.4,'field5':'aab','field6':1234,'field7':12.4,'field8':'aab','field9':1234,'field10':12.4,'field11':'aab','field12':1234,'field13':12.4,'field14':'aab','field15':1234,'field16':12.4,'field17':'aab','field18':1234,'field19':12.4,'field20':'aab','field21':1234,'field22':12.4,'field23':'aab','field24':1234,'field25':12.4,'field26':'aab','field27':1234,'field28':12.4,'field29':'aab','field30':1234,'field31':12.4,'field32':'aab','field33':1234,'field34':12.4,'field35':'aab','field36':1234,'field37':12.4,'field38':'aab','field39':1234,'field40':12.4,'field41':'aab','field42':1234,'field43':12.4,'field44':'aab','field45':1234,'field46':12.4,'field47':'aab','field48':1234,'field49':12.4},
{'datetime':"2018-08-01 22:08:44",'field1':12.4,'field2':'aab','field3':1234,'field4':12.4,'field5':'aab','field6':1234,'field7':12.4,'field8':'aab','field21':1234,'field22':12.4,'field23':'aab','field24':1234,'field25':12.4,'field26':'aab','field27':1234,'field28':12.4,'field29':'aab','field30':1234,'field31':12.4,'field32':'aab','field33':1234,'field34':12.4,'field35':'aab','field36':1234,'field37':12.4,'field38':'aab','field39':1234,'field40':12.4,'field41':'aab','field42':1234,'field43':12.4,'field44':'aab','field45':1234,'field46':12.4,'field47':'aab','field48':1234,'field49':12.4},
{'datetime':"2018-08-01 22:09:46",'field1':12.4,'field2':'aab','field3':1234,'field4':12.4,'field5':'aab','field6':1234,'field7':12.4,'field8':'aab','field9':1234,'field10':12.4,'field11':'aab','field12':1234,'field13':12.4,'field14':'aab','field15':1234,'field16':12.4,'field17':'aab','field18':1234,'field25':12.4,'field26':'aab','field27':1234,'field28':12.4,'field29':'aab','field30':1234,'field31':12.4,'field32':'aab','field33':1234,'field34':12.4,'field35':'aab','field36':1234,'field37':12.4,'field38':'aab','field39':1234,'field40':12.4,'field41':'aab','field42':1234,'field43':12.4,'field44':'aab','field45':1234,'field46':12.4,'field47':'aab','field48':1234,'field49':12.4},
{'datetime':"2018-08-01 22:10:49",'field1':12.4,'field2':'aab','field3':1234,'field4':12.4,'field5':'aab','field6':1234,'field7':12.4,'field8':'aab','field9':1234,'field10':12.4,'field11':'aab','field12':1234,'field13':12.4,'field14':'aab','field15':1234,'field16':12.4,'field17':'aab','field18':1234,'field19':12.4,'field20':'aab','field21':1234,'field22':12.4,'field23':'aab','field24':1234,'field25':12.4,'field26':'aab','field27':1234,'field28':12.4,'field41':'aab','field42':1234,'field43':12.4,'field44':'aab','field45':1234,'field46':12.4,'field47':'aab','field48':1234,'field49':12.4},
{'datetime':"2018-08-01 22:11:50",'field1':12.4,'field2':'aab','field3':1234,'field4':12.4,'field5':'aab','field6':1234,'field7':12.4,'field8':'aab','field9':1234,'field10':12.4,'field11':'aab','field12':1234,'field13':12.4,'field14':'aab','field15':1234,'field16':12.4,'field17':'aab','field18':1234,'field19':12.4,'field20':'aab','field21':1234,'field22':12.4,'field23':'aab','field24':1234,'field25':12.4,'field26':'aab','field27':1234,'field28':12.4,'field29':'aab','field30':1234,'field31':12.4,'field32':'aab','field33':1234,'field34':12.4,'field35':'aab','field36':1234,'field37':12.4,'field48':1234,'field49':12.4},
{'datetime':"2018-08-01 22:12:54",'field1':12.4,'field2':'aab','field3':1234,'field4':12.4,'field5':'aab','field6':1234,'field7':12.4,'field8':'aab','field9':1234,'field10':12.4,'field11':'aab','field12':1234,'field13':12.4,'field14':'aab','field15':1234,'field16':12.4,'field17':'aab','field18':1234,'field19':12.4,'field20':'aab','field21':1234,'field22':12.4,'field23':'aab','field24':1234,'field25':12.4,'field26':'aab','field27':1234,'field28':12.4,'field38':'aab','field39':1234,'field40':12.4,'field41':'aab','field42':1234,'field43':12.4,'field44':'aab','field45':1234,'field46':12.4,'field47':'aab','field48':1234,'field49':12.4},
{'datetime':"2018-08-01 22:15:55",'field1':12.4,'field2':'aab','field3':1234,'field4':12.4,'field5':'aab','field6':1234,'field7':12.4,'field8':'aab','field9':1234,'field10':12.4,'field11':'aab','field12':1234,'field13':12.4,'field14':'aab','field15':1234,'field16':12.4,'field17':'aab','field18':1234,'field19':12.4,'field20':'aab','field21':1234,'field22':12.4,'field23':'aab','field24':1234,'field25':12.4,'field26':'aab','field27':1234,'field28':12.4,'field29':'aab','field30':1234,'field31':12.4,'field32':'aab','field33':1234,'field34':12.4,'field35':'aab','field36':1234,'field37':12.4,'field38':'aab','field39':1234,'field40':12.4,'field41':'aab','field42':1234,'field43':12.4,'field44':'aab','field45':1234,'field46':12.4,'field47':'aab','field48':1234,'field49':12.4},
{'datetime':"2018-08-01 22:16:01",'field1':12.4,'field2':'aab','field3':1234,'field4':12.4,'field5':'aab','field6':1234,'field7':12.4,'field8':'aab','field28':12.4,'field29':'aab','field30':1234,'field31':12.4,'field32':'aab','field33':1234,'field34':12.4,'field35':'aab','field36':1234,'field37':12.4,'field38':'aab','field39':1234,'field40':12.4,'field41':'aab','field42':1234,'field43':12.4,'field44':'aab','field45':1234,'field46':12.4,'field47':'aab','field48':1234,'field49':12.4}
]

columns_rule = {
'field1':{
    'max_value':999,
    'min_value':0
},
'field3':{
    'max_value':999,
    'min_value':0
},
'field4':{
    'max_value':999,
    'min_value':0
},
'field6':{
    'max_value':999,
    'min_value':0
},
'field7':{
    'max_value':999,
    'min_value':0
},
'field9':{
    'max_value':999,
    'min_value':0
},
'field10':{
    'max_value':999,
    'min_value':0
},
'field12':{
    'max_value':999,
    'min_value':0
},
'field13':{
    'max_value':999,
    'min_value':0
},
'field15':{
    'max_value':999,
    'min_value':0
},
'field16':{
    'max_value':999,
    'min_value':0
},
'field18':{
    'max_value':999,
    'min_value':0
},
'field19':{
    'max_value':999,
    'min_value':0
},
'field21':{
    'max_value':999,
    'min_value':0
},
'field22':{
    'max_value':999,
    'min_value':0
},
'field24':{
    'max_value':999,
    'min_value':0
},
'field25':{
    'max_value':999,
    'min_value':0
},
'field26':'aab',
'field27':{
    'max_value':999,
    'min_value':0
},
'field28':{
    'max_value':999,
    'min_value':0
},
'field30':{
    'max_value':999,
    'min_value':0
},
'field31':{
    'max_value':999,
    'min_value':0
},
'field33':{
    'max_value':999,
    'min_value':0
},
'field34':{
    'max_value':999,
    'min_value':0
},
'field36':{
    'max_value':999,
    'min_value':0
},
'field37':{
    'max_value':999,
    'min_value':0
},
'field39':{
    'max_value':999,
    'min_value':0
},
'field40':{
    'max_value':999,
    'min_value':0
},
'field42':{
    'max_value':999,
    'min_value':0
},
'field43':{
    'max_value':999,
    'min_value':0
},
'field45':{
    'max_value':999,
    'min_value':0
},
'field46':{
    'max_value':999,
    'min_value':0
},
'field48':{
    'max_value':999,
    'min_value':0
},
'field49':{
    'max_value':999,
    'min_value':0
}
}

1 个答案:

答案 0 :(得分:1)

想法是循环提取dict的值,然后对所有匹配的列的DataFrame进行处理,并将resamplegroupby一起使用df = df.set_index('datetime').groupby(pd.Grouper(freq='60S')).first() dmin, dmax = {}, {} cmin, cmax = [],[] for column in columns_rule: if column in df.columns: if 'max_value' in columns_rule[column]: dmax[column] = columns_rule[column]['max_value'] cmax.append(column) if 'min_value' in columns_rule[column]: dmin[column] = columns_rule[column]['min_value'] cmin.append(column) m1 = df[cmax].lt(pd.Series(dmax)) m2 = df[cmin].gt(pd.Series(dmin)) cols = np.union1d(cmin, cmax) df[cmax] = np.where(m1, df[cmax], np.nan) df[cmin] = np.where(m2, df[cmin], np.nan) df[cols] = df[cols].interpolate(method='linear', limit=3)

tinyint(1)