根据其他三列的多数值设置pandas数据框获胜者列值

时间:2017-05-15 23:58:43

标签: python pandas

我有pandas df这个

cfg.CreateMap<IFoo, FooModel>().ConvertUsing<MyConverter>();
cfg.CreateMap<IPager<IFoo>, IPager<FooModel>>().ConvertUsing<MyConverter>();

我想添加另一个名为id Vote1 Vote2 Vote3 123 Positive Negative Positive 223 Positive Negative Neutral 323 Positive Negative Negative 423 Positive Positive 的列 这将被设置为大多数投票,如果有平局,则第一次投票将被设置,如id = 223所示

所以结果df应该是

winner

这可能与此有关 Update Pandas Cells based on Column Values and Other Columns

3 个答案:

答案 0 :(得分:2)

您可以逐行执行此操作,如下所示:

import pandas as pd
import numpy as np

# Create the dataframe
df = pd.DataFrame()
df['id']=[123,223,323,423]
df['Vote1']=['Positive']*4
df['Vote2']=['Negative']*3+['Positive']
df['Vote3']=['Positive','Neutral','Negative','']

mostCommonVote=[]
for row in df[['Vote1','Vote2','Vote3']].values:
    votes, values = np.unique(row, return_counts=True)
    if np.all(values<=1):
            mostCommonVote.append( row[0] )
    else:
        mostCommonVote.append( votes[np.argmax(values)] )

df['Winner'] = mostCommonVote

结果:

 df:
    id     Vote1     Vote2     Vote3    Winner
0  123  Positive  Negative  Positive  Positive
1  223  Positive  Negative   Neutral  Positive
2  323  Positive  Negative  Negative  Negative
3  423  Positive  Positive            Positive

它可能不是最优雅的解决方案,但它非常简单。它使用numpy函数 unique ,它可以返回行的每个唯一字符串的计数。

答案 1 :(得分:1)

另一个没有循环的Pandas解决方案:

df = df.set_index('id')
rep = {'Positive':1,'Negative':-1,'Neutral':0}
df1 = df.replace(rep)

df = df.assign(Winner=np.where(df1.sum(axis=1) > 0,'Positive',np.where(df1.sum(axis=1) < 0, 'Negative', df.iloc[:,0])))
print(df)

输出:

        Vote1     Vote2     Vote3    Winner
id                                         
123  Positive  Negative  Positive  Positive
223  Positive  Negative   Neutral  Positive
323  Positive  Negative  Negative  Negative
423  Positive  Positive       NaN  Positive

解释

df.assign是一种在原始数据框的副本中创建列的方法,因此您必须重新分配回df。该列的名称为Winner,因此&#39;获胜者=&#39;。

接下来,您使用np.where嵌套if语句... np.where(cond,result,else)

np.where(df.sum(axis=1) > 0,  # this sums the dataframe by row
         'Positive',  #if true
         np.where(df.sum(axis=1) < 0, #nested if the first if return false  
                  'Negative', #sum of the row is less than 0
                  df.iloc[:,0] #sum = 0 get the first value from that row.
                  )
         )

答案 2 :(得分:0)

我写了一个函数并将其应用于df。它通常比正常循环快一点。

import pandas as pd
import numpy as np

def vote(row):
    pos = np.sum(row.values == 'Positive')
    neg = np.sum(row.values == 'Negative')
    if pos > neg:
        return('Positive')
    elif pos < neg: 
        return('Negative')
    else:
        return(row['Vote1'])

# Create the dataframe
df = pd.DataFrame()
df['id']=[123,223,323,423]
df['Vote1']=['Positive']*4
df['Vote2']=['Negative']*3+['Positive']
df['Vote3']=['Positive','Neutral','Negative','']
df = df.set_index('id')
df['Winner'] = df.apply(vote,axis=1)

结果

Out[41]: 
        Vote1     Vote2     Vote3    Winner
id                                         
123  Positive  Negative  Positive  Positive
223  Positive  Negative   Neutral  Positive
323  Positive  Negative  Negative  Negative
423  Positive  Positive            Positive