我需要在多个条件的基础上对数据帧行进行子集化。每个条件由一组列描述。说,有列
size_10ml
size_20ml
size_30ml
并且所有其他列中只有一个列中的1
和零。
因此,要按大小和品牌选择项目(行),我会将[["size_10ml", "size_20ml"], ["brand_A", "brand_E"]]
传递给以下函数:
def any_of_intersect_columns(df, *column_lists):
""" Choose rows ANDing multiple conditions. I.e. choose rows having nonzero value in at least one of the columns
in all sets.
column_lists : Each argument is iterable. It is is a list of column labels.
A row meets condition if any of labeled columns from the current list is true.
Then rows from each condition (list) are intersected
Return
-----
df : subset of df rows
"""
by_row = df
for columns in column_lists:
# choose columns of interest
try:
by_col = df[columns]
# leave rows, evaluating True in at least one of chosen columns
by_row = by_row.loc[by_col.any(axis=1), :]
except KeyError:
error("None of columns has labels {}".format(columns))
by_row = pd.DataFrame()
# return all, if nothing fits conditions
return by_row if by_row.shape[0] else df
对于不同的条件“级别”,该函数被调用几次以选择一个项目并且有许多项目,所有这些都来自一个表格。我需要优化这种方法,因为这是性能瓶颈。
数据和输出示例:
>>> df
size_10ml size_20ml brand_A brand_E property_1
0 1 0 1 0 0
1 0 1 0 1 1
2 0 1 1 0 0
>>> any_of_intersect_columns(df, [["size_10ml", "size_20ml"], ["brand_A"]])
>>> [0, 2]
最后,可以重构为列中的字符串属性值而不是1和0,但我认为这只会减慢速度。