Pandas数据框子化性能优化

时间:2018-03-28 09:53:18

标签: pandas

我需要在多个条件的基础上对数据帧行进行子集化。每个条件由一组列描述。说,有列 size_10ml size_20ml size_30ml 并且所有其他列中只有一个列中的1和零。 因此,要按大小和品牌选择项目(行),我会将[["size_10ml", "size_20ml"], ["brand_A", "brand_E"]]传递给以下函数:

def any_of_intersect_columns(df, *column_lists):
    """ Choose rows ANDing multiple conditions. I.e. choose rows having nonzero value in at least one of the columns
    in all sets.

    column_lists : Each argument is iterable. It is is a list of column labels. 
                   A row meets condition if any of labeled columns from the current list is true.
                   Then rows from each condition (list) are intersected

    Return
    -----
    df :  subset of df rows
    """

    by_row = df 
    for columns in column_lists:
        # choose columns of interest
        try:
            by_col = df[columns]
            # leave rows, evaluating True in at least one of chosen columns
            by_row = by_row.loc[by_col.any(axis=1), :]
        except KeyError:
            error("None of columns has labels {}".format(columns)) 
            by_row = pd.DataFrame()
    # return all, if nothing fits conditions
    return by_row if by_row.shape[0] else df 

对于不同的条件“级别”,该函数被调用几次以选择一个项目并且有许多项目,所有这些都来自一个表格。我需要优化这种方法,因为这是性能瓶颈。

数据和输出示例:

>>> df
   size_10ml  size_20ml  brand_A  brand_E  property_1
0          1          0        1        0           0
1          0          1        0        1           1
2          0          1        1        0           0
>>> any_of_intersect_columns(df, [["size_10ml", "size_20ml"], ["brand_A"]])
>>> [0, 2]

最后,可以重构为列中的字符串属性值而不是1和0,但我认为这只会减慢速度。

0 个答案:

没有答案