Question

我有一个大型df，具有1000列，此处为较短版本：

largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'], 'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'], 
                   'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})
   arow     bread  fruit tea   water
0  row1     b      c     b     b
1  row2     b      b     a     c
2  row3     b      b     b     b
3  row4     a      a     a     c

我要保存的行只有个类别，而没有b，其中的类别被定义为列表（再次，实际上，列表多于2个）：

food = ['bread', 'fruit']
drink = ['tea', 'water']

row2是在这种情况下唯一要保存的行。没有row1的{{1}}没有类别， b全部是row3， b并非全部row4

首选输出的单个not b类别将有一列，并且该行中not b的百分比是：

Answer 1

根据您提供的列表对index.jsx的布尔位置进行计数

现在根据您的条件进行过滤。在这个玩具示例中，计数的乘积必须等于零，并且总和必须大于零

largedf['drink'] = (largedf[drink] == 'b').sum(1)
largedf['food'] = (largedf[food] == 'b').sum(1)

Answer 2

我在这里提出一种解决方案，试图证明您的DataFrame将从多索引中受益。

largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'], 'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'],
                   'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})

largedf.set_index('arow',inplace=True)

food = ['bread', 'fruit']
drink = ['tea', 'water']
dict = {'food':food,'drink':drink}

l = []
for k,v in dict.iteritems():
    for y in v:
        l.append((k,y))

largedf.columns = pd.MultiIndex.from_tuples(l)
print largedf

      food       drink      
     bread fruit   tea water
arow                        
row1     b     c     b     b
row2     b     b     a     c
row3     b     b     b     b
row4     a     a     a     c

idx = pd.IndexSlice
cond1 = (largedf.loc[:,idx['food']] == 'b').any(axis=1) *1
cond2 = (largedf.loc[:,idx['drink']]== 'b').any(axis=1) *1

# you want rows where (cond1 + cond2) = 1
largedf[('perc','perc')] = largedf.apply(lambda x: (x =='b').sum()/4. ,axis=1)
print largedf.join(pd.DataFrame(((cond1 + cond2) == 1),columns=[('match','match')]))

      food       drink         perc  match
     bread fruit   tea water   perc  match
arow                                      
row1     b     c     b     b 0.7500  False
row2     b     b     a     c 0.5000   True
row3     b     b     b     b 1.0000  False
row4     a     a     a     c 0.0000  False

Answer 3

也许是这样，但是您需要在过滤器中添加自己的逻辑：

def fltr(df):
    # empty result frame, same index as df
    dfR = pd.DataFrame(index=df.index)
    # Insert your logic here
    for i, row in df.iloc[:].iterrows():
        if row['bread'] == 'b' and row['fruit'] == 'b':
            # just copy the row in this case 
            for k, v in row.items():
                dfR.loc[i, k] = v
            # add single col. items
            #dfR.loc[i, 'bread'] = "b"
            #dfR.loc[i, 'fruit'] = "f"
    # etc

    return dfR

food = ['bread', 'fruit']
drink = ['tea', 'water']
largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'],
                'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'], 
               'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})
print(largedf)
resultDF = largedf.pipe(fltr)
print(resultDF)


arow    bread   fruit   tea water
0   row1    b   c   b   b
1   row2    b   b   a   c
2   row3    b   b   b   b
3   row4    a   a   a   c

arow    bread   fruit   tea water
0   NaN NaN NaN NaN NaN
1   row2    b   b   a   c
2   row3    b   b   b   b
3   NaN NaN NaN NaN NaN

熊猫：根据列列表过滤行

3 个答案: