我很难按照我需要的方式屏蔽数据帧。我的数据框适用于产品,其中单个产品可以采用各种格式或语言。它看起来像:
import pandas as pd
from numpy.random import choice
prods = [1234,1234,1234,1234,12344,12344,12344,12344,3462,3462,3462,3462,12314,12314,12314,12314,12857,12857,12857,12857]
formats = choice(['Hrd','Elc','Sft'],size=20)
language = choice(['Eng','Spa','Jpn','Chn','Port','Fnch','Rus'],size=20)
restricted = choice(range(5,9),size=20)
df = pd.DataFrame({'products': prods,'formats':formats,'language': language, 'restricted': restricted})
df['instances'] = df['products'].astype(str) + '-' + df['formats'] + '-' + df['language']
md = pd.MultiIndex.from_tuples(list(zip(df['products'],df['instances'])))
df.set_index(md)
df
Out[1]:
formats language products restricted instances
1234 1234-Sft-Port Sft Port 1234 5 1234-Sft-Port
1234-Elc-Jpn Elc Jpn 1234 7 1234-Elc-Jpn
1234-Hrd-Jpn Hrd Jpn 1234 7 1234-Hrd-Jpn
1234-Hrd-Chn Hrd Chn 1234 5 1234-Hrd-Chn
12344 12344-Sft-Chn Sft Chn 12344 5 12344-Sft-Chn
12344-Hrd-Spa Hrd Spa 12344 7 12344-Hrd-Spa
12344-Elc-Jpn Elc Jpn 12344 6 12344-Elc-Jpn
12344-Sft-Port Sft Port 12344 5 12344-Sft-Port
3462 3462-Hrd-Jpn Hrd Jpn 3462 5 3462-Hrd-Jpn
3462-Hrd-Jpn Hrd Jpn 3462 7 3462-Hrd-Jpn
3462-Sft-Port Sft Port 3462 6 3462-Sft-Port
3462-Elc-Jpn Elc Jpn 3462 7 3462-Elc-Jpn
12314 12314-Sft-Rus Sft Rus 12314 5 12314-Sft-Rus
12314-Elc-Spa Elc Spa 12314 5 12314-Elc-Spa
12314-Hrd-Port Hrd Port 12314 7 12314-Hrd-Port
12314-Elc-Port Elc Port 12314 7 12314-Elc-Port
12857 12857-Elc-Jpn Elc Jpn 12857 8 12857-Elc-Jpn
12857-Elc-Spa Elc Spa 12857 5 12857-Elc-Spa
12857-Hrd-Chn Hrd Chn 12857 5 12857-Hrd-Chn
12857-Sft-Port Sft Port 12857 7 12857-Sft-Port
如何屏蔽或索引多个变量?我想指定类似于"选择其中电子格式为西班牙语的产品,其中一种格式是俄语精装"。我不能简单地屏蔽我的数据框,例如df[(df['language'] == 'Spa') & (df['format'] == 'Elc')]
,因为它不会过滤包含精装的该产品的其他格式。
我使用了复杂的groupby
lambda函数,但对于大型数据框(我的超过200,000行)来说速度非常慢:
mask = df.groupby('products')
mask.apply(lambda x:
'spa' in x['formats'].values and
'Hrd' in x[x['language']=='Rus']['formats'].values
)
我已经调查了df.query()
和其他一些方法/函数,但似乎找不到按照产品需要分组的方式与我的数据框进行交互的方法。还有更好的方法吗?
答案 0 :(得分:0)
由于您未使用确定性随机种子,因此无法重现您的确切结果,但是我可以使用'或'运算符|
通过两个相交的并集来进行索引:
import pandas as pd
from numpy.random import RandomState
...:
...: rand = RandomState(4321)
...: prods = [1234,1234,1234,1234,12344,12344,12344,12344,3462,3462,3462,3462,12314,1231
...: 4,12314,12314,12857,12857,12857,12857]
...: formats = rand.choice(['Hrd','Elc','Sft'],size=20)
...: language = rand.choice(['Eng','Spa','Jpn','Chn','Port','Fnch','Rus'],size=20)
...: restricted = rand.choice(range(5,9),size=20)
...: df = pd.DataFrame({'products': prods,'formats':formats,'language': language, 'restr
...: icted': restricted})
...: df['instances'] = df['products'].astype(str) + '-' + df['formats'] + '-' + df['lang
...: uage']
...: md = pd.MultiIndex.from_tuples(list(zip(df['products'],df['instances'])))
...: df2 = df.set_index(md)
...: df2
...:
...:
Out[1]:
products formats language restricted instances
1234 1234-Elc-Spa 1234 Elc Spa 8 1234-Elc-Spa
1234-Sft-Rus 1234 Sft Rus 8 1234-Sft-Rus
1234-Hrd-Spa 1234 Hrd Spa 7 1234-Hrd-Spa
1234-Sft-Spa 1234 Sft Spa 7 1234-Sft-Spa
12344 12344-Hrd-Spa 12344 Hrd Spa 8 12344-Hrd-Spa
12344-Sft-Rus 12344 Sft Rus 5 12344-Sft-Rus
12344-Elc-Fnch 12344 Elc Fnch 7 12344-Elc-Fnch
12344-Elc-Spa 12344 Elc Spa 6 12344-Elc-Spa
3462 3462-Elc-Fnch 3462 Elc Fnch 8 3462-Elc-Fnch
3462-Sft-Jpn 3462 Sft Jpn 6 3462-Sft-Jpn
3462-Hrd-Port 3462 Hrd Port 6 3462-Hrd-Port
3462-Sft-Eng 3462 Sft Eng 8 3462-Sft-Eng
12314 12314-Elc-Spa 12314 Elc Spa 7 12314-Elc-Spa
12314-Hrd-Spa 12314 Hrd Spa 7 12314-Hrd-Spa
12314-Elc-Fnch 12314 Elc Fnch 7 12314-Elc-Fnch
12314-Hrd-Port 12314 Hrd Port 5 12314-Hrd-Port
12857 12857-Hrd-Port 12857 Hrd Port 7 12857-Hrd-Port
12857-Sft-Rus 12857 Sft Rus 5 12857-Sft-Rus
12857-Elc-Rus 12857 Elc Rus 6 12857-Elc-Rus
12857-Elc-Jpn 12857 Elc Jpn 8 12857-Elc-Jpn
肯定有一个更优雅的解决方案(或更短的hack,例如将products
和formats
中的字符串连接起来并过滤结果),但这可行:
filter_df = df2[((df2.formats == 'Elc') & (df2.language == 'Spa')) | ((df2.formats == 'Sft') & (df2.language == 'Rus'))]
filter_groups = filter_df.groupby(level=0)['products'].count()
filter_index = filter_groups[filter_groups > 1].index
df3 = df2[df2.index.get_level_values(0).isin(filter_index)]
Out[3]:
products formats language restricted instances
1234 1234-Elc-Spa 1234 Elc Spa 8 1234-Elc-Spa
1234-Sft-Rus 1234 Sft Rus 8 1234-Sft-Rus
1234-Hrd-Spa 1234 Hrd Spa 7 1234-Hrd-Spa
1234-Sft-Spa 1234 Sft Spa 7 1234-Sft-Spa
12344 12344-Hrd-Spa 12344 Hrd Spa 8 12344-Hrd-Spa
12344-Sft-Rus 12344 Sft Rus 5 12344-Sft-Rus
12344-Elc-Fnch 12344 Elc Fnch 7 12344-Elc-Fnch
12344-Elc-Spa 12344 Elc Spa 6 12344-Elc-Spa
您曾要求选择产品;如果您只想缩小匹配的产品实例,则需要根据这些结果再次进行过滤。