我有一个大型df,具有1000列,此处为较短版本:
largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'], 'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'],
'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})
arow bread fruit tea water
0 row1 b c b b
1 row2 b b a c
2 row3 b b b b
3 row4 a a a c
我要保存的行只有个类别,而没有b
,其中的类别被定义为列表(再次,实际上,列表多于2个):
food = ['bread', 'fruit']
drink = ['tea', 'water']
row2
是在这种情况下唯一要保存的行。
没有row1
的{{1}}没有类别,
b
全部是row3
,
b
并非全部row4
首选输出的单个not b
类别将有一列,并且该行中not b
的百分比是:
b
答案 0 :(得分:2)
根据您提供的列表对index.jsx
的布尔位置进行计数
b
现在根据您的条件进行过滤。在这个玩具示例中,计数的乘积必须等于零,并且总和必须大于零
largedf['drink'] = (largedf[drink] == 'b').sum(1)
largedf['food'] = (largedf[food] == 'b').sum(1)
答案 1 :(得分:2)
我在这里提出一种解决方案,试图证明您的DataFrame将从多索引中受益。
largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'], 'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'],
'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})
largedf.set_index('arow',inplace=True)
food = ['bread', 'fruit']
drink = ['tea', 'water']
dict = {'food':food,'drink':drink}
l = []
for k,v in dict.iteritems():
for y in v:
l.append((k,y))
largedf.columns = pd.MultiIndex.from_tuples(l)
print largedf
food drink
bread fruit tea water
arow
row1 b c b b
row2 b b a c
row3 b b b b
row4 a a a c
idx = pd.IndexSlice
cond1 = (largedf.loc[:,idx['food']] == 'b').any(axis=1) *1
cond2 = (largedf.loc[:,idx['drink']]== 'b').any(axis=1) *1
# you want rows where (cond1 + cond2) = 1
largedf[('perc','perc')] = largedf.apply(lambda x: (x =='b').sum()/4. ,axis=1)
print largedf.join(pd.DataFrame(((cond1 + cond2) == 1),columns=[('match','match')]))
food drink perc match
bread fruit tea water perc match
arow
row1 b c b b 0.7500 False
row2 b b a c 0.5000 True
row3 b b b b 1.0000 False
row4 a a a c 0.0000 False
答案 2 :(得分:0)
也许是这样,但是您需要在过滤器中添加自己的逻辑:
def fltr(df):
# empty result frame, same index as df
dfR = pd.DataFrame(index=df.index)
# Insert your logic here
for i, row in df.iloc[:].iterrows():
if row['bread'] == 'b' and row['fruit'] == 'b':
# just copy the row in this case
for k, v in row.items():
dfR.loc[i, k] = v
# add single col. items
#dfR.loc[i, 'bread'] = "b"
#dfR.loc[i, 'fruit'] = "f"
# etc
return dfR
food = ['bread', 'fruit']
drink = ['tea', 'water']
largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'],
'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'],
'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})
print(largedf)
resultDF = largedf.pipe(fltr)
print(resultDF)
arow bread fruit tea water
0 row1 b c b b
1 row2 b b a c
2 row3 b b b b
3 row4 a a a c
arow bread fruit tea water
0 NaN NaN NaN NaN NaN
1 row2 b b a c
2 row3 b b b b
3 NaN NaN NaN NaN NaN