使用pandas多索引进行多变量屏蔽

时间:2018-03-07 20:01:07

标签: python pandas indexing

我很难按照我需要的方式屏蔽数据帧。我的数据框适用于产品,其中单个产品可以采用各种格式或语言。它看起来像:

import pandas as pd
from numpy.random import choice

prods = [1234,1234,1234,1234,12344,12344,12344,12344,3462,3462,3462,3462,12314,12314,12314,12314,12857,12857,12857,12857]
formats = choice(['Hrd','Elc','Sft'],size=20)
language = choice(['Eng','Spa','Jpn','Chn','Port','Fnch','Rus'],size=20)
restricted = choice(range(5,9),size=20)
df = pd.DataFrame({'products': prods,'formats':formats,'language': language, 'restricted': restricted})
df['instances'] = df['products'].astype(str) + '-' + df['formats'] + '-' + df['language']
md = pd.MultiIndex.from_tuples(list(zip(df['products'],df['instances'])))
df.set_index(md)

df
Out[1]:
                     formats language  products  restricted       instances
1234  1234-Sft-Port      Sft     Port      1234           5   1234-Sft-Port
      1234-Elc-Jpn       Elc      Jpn      1234           7    1234-Elc-Jpn
      1234-Hrd-Jpn       Hrd      Jpn      1234           7    1234-Hrd-Jpn
      1234-Hrd-Chn       Hrd      Chn      1234           5    1234-Hrd-Chn
12344 12344-Sft-Chn      Sft      Chn     12344           5   12344-Sft-Chn
      12344-Hrd-Spa      Hrd      Spa     12344           7   12344-Hrd-Spa
      12344-Elc-Jpn      Elc      Jpn     12344           6   12344-Elc-Jpn
      12344-Sft-Port     Sft     Port     12344           5  12344-Sft-Port
3462  3462-Hrd-Jpn       Hrd      Jpn      3462           5    3462-Hrd-Jpn
      3462-Hrd-Jpn       Hrd      Jpn      3462           7    3462-Hrd-Jpn
      3462-Sft-Port      Sft     Port      3462           6   3462-Sft-Port
      3462-Elc-Jpn       Elc      Jpn      3462           7    3462-Elc-Jpn
12314 12314-Sft-Rus      Sft      Rus     12314           5   12314-Sft-Rus
      12314-Elc-Spa      Elc      Spa     12314           5   12314-Elc-Spa
      12314-Hrd-Port     Hrd     Port     12314           7  12314-Hrd-Port
      12314-Elc-Port     Elc     Port     12314           7  12314-Elc-Port
12857 12857-Elc-Jpn      Elc      Jpn     12857           8   12857-Elc-Jpn
      12857-Elc-Spa      Elc      Spa     12857           5   12857-Elc-Spa
      12857-Hrd-Chn      Hrd      Chn     12857           5   12857-Hrd-Chn
      12857-Sft-Port     Sft     Port     12857           7  12857-Sft-Port

如何屏蔽或索引多个变量?我想指定类似于"选择其中电子格式为西班牙语的产品,其中一种格式是俄语精装"。我不能简单地屏蔽我的数据框,例如df[(df['language'] == 'Spa') & (df['format'] == 'Elc')],因为它不会过滤包含精装的该产品的其他格式。

我使用了复杂的groupby lambda函数,但对于大型数据框(我的超过200,000行)来说速度非常慢:

mask = df.groupby('products')
mask.apply(lambda x: 
    'spa' in x['formats'].values and 
    'Hrd' in x[x['language']=='Rus']['formats'].values
    )

我已经调查了df.query()和其他一些方法/函数,但似乎找不到按照产品需要分组的方式与我的数据框进行交互的方法。还有更好的方法吗?

1 个答案:

答案 0 :(得分:0)

由于您未使用确定性随机种子,因此无法重现您的确切结果,但是我可以使用'或'运算符|通过两个相交的并集来进行索引:

import pandas as pd    
from numpy.random import RandomState
   ...:
   ...: rand = RandomState(4321)
   ...: prods = [1234,1234,1234,1234,12344,12344,12344,12344,3462,3462,3462,3462,12314,1231
   ...: 4,12314,12314,12857,12857,12857,12857]
   ...: formats = rand.choice(['Hrd','Elc','Sft'],size=20)
   ...: language = rand.choice(['Eng','Spa','Jpn','Chn','Port','Fnch','Rus'],size=20)
   ...: restricted = rand.choice(range(5,9),size=20)
   ...: df = pd.DataFrame({'products': prods,'formats':formats,'language': language, 'restr
   ...: icted': restricted})
   ...: df['instances'] = df['products'].astype(str) + '-' + df['formats'] + '-' + df['lang
   ...: uage']
   ...: md = pd.MultiIndex.from_tuples(list(zip(df['products'],df['instances'])))
   ...: df2 = df.set_index(md)
   ...: df2
   ...:
   ...:
Out[1]:
                      products formats language  restricted       instances
1234  1234-Elc-Spa        1234     Elc      Spa           8    1234-Elc-Spa
      1234-Sft-Rus        1234     Sft      Rus           8    1234-Sft-Rus
      1234-Hrd-Spa        1234     Hrd      Spa           7    1234-Hrd-Spa
      1234-Sft-Spa        1234     Sft      Spa           7    1234-Sft-Spa
12344 12344-Hrd-Spa      12344     Hrd      Spa           8   12344-Hrd-Spa
      12344-Sft-Rus      12344     Sft      Rus           5   12344-Sft-Rus
      12344-Elc-Fnch     12344     Elc     Fnch           7  12344-Elc-Fnch
      12344-Elc-Spa      12344     Elc      Spa           6   12344-Elc-Spa
3462  3462-Elc-Fnch       3462     Elc     Fnch           8   3462-Elc-Fnch
      3462-Sft-Jpn        3462     Sft      Jpn           6    3462-Sft-Jpn
      3462-Hrd-Port       3462     Hrd     Port           6   3462-Hrd-Port
      3462-Sft-Eng        3462     Sft      Eng           8    3462-Sft-Eng
12314 12314-Elc-Spa      12314     Elc      Spa           7   12314-Elc-Spa
      12314-Hrd-Spa      12314     Hrd      Spa           7   12314-Hrd-Spa
      12314-Elc-Fnch     12314     Elc     Fnch           7  12314-Elc-Fnch
      12314-Hrd-Port     12314     Hrd     Port           5  12314-Hrd-Port
12857 12857-Hrd-Port     12857     Hrd     Port           7  12857-Hrd-Port
      12857-Sft-Rus      12857     Sft      Rus           5   12857-Sft-Rus
      12857-Elc-Rus      12857     Elc      Rus           6   12857-Elc-Rus
      12857-Elc-Jpn      12857     Elc      Jpn           8   12857-Elc-Jpn

肯定有一个更优雅的解决方案(或更短的hack,例如将productsformats中的字符串连接起来并过滤结果),但这可行:

filter_df = df2[((df2.formats == 'Elc') & (df2.language == 'Spa')) | ((df2.formats == 'Sft') & (df2.language == 'Rus'))]
filter_groups = filter_df.groupby(level=0)['products'].count()
filter_index = filter_groups[filter_groups > 1].index
df3 = df2[df2.index.get_level_values(0).isin(filter_index)]

Out[3]:
                      products formats language  restricted       instances
1234  1234-Elc-Spa        1234     Elc      Spa           8    1234-Elc-Spa
      1234-Sft-Rus        1234     Sft      Rus           8    1234-Sft-Rus
      1234-Hrd-Spa        1234     Hrd      Spa           7    1234-Hrd-Spa
      1234-Sft-Spa        1234     Sft      Spa           7    1234-Sft-Spa
12344 12344-Hrd-Spa      12344     Hrd      Spa           8   12344-Hrd-Spa
      12344-Sft-Rus      12344     Sft      Rus           5   12344-Sft-Rus
      12344-Elc-Fnch     12344     Elc     Fnch           7  12344-Elc-Fnch
      12344-Elc-Spa      12344     Elc      Spa           6   12344-Elc-Spa

您曾要求选择产品;如果您只想缩小匹配的产品实例,则需要根据这些结果再次进行过滤。