Question

我有一个df，如下所示。

p1_conf，p2_conf和p3_conf分别显示了模型p1，p2和p3的置信区间。

我想知道如何选择每行具有最高置信区间的预测并将其存储在一些新列中。因此结果将为：

您可以将df以下的内容用作原始df：

df = pd.DataFrame({"id": [1,2,3,4,5],
                "Name": ["Dave","Max","Joe","Rose","Mark"],
                "model1":["Irish","German","USA","Japan","China"],
                "confidence1": [0.9,.99,.83,.45,.51],
                "prediction1": [True,False,True,False,False],
                "model2":["Oman","Nigeria","India","Russia","Brazil"],
                "confidence2": [0.1,.25,.26,.41,.01],
                "prediction2": [False,True,False,False,False],
                "model3":["Egypt","Cameron","Netherland","Canada","Mexcio"],
                "confidence3": [0.01,.23,.12,.34,.61],
                "prediction3": [True,False,True,True,False]})

结果应该是这样的：

df1 = pd.DataFrame({"id": [1,2,3,4,5],
                 "Name":["Dave","Max","Joe","Rose","Mark"],
                 "model_name":["1","2","1","3",None],
                 "predicted_gener":["Irish","Nigeria","USA","Canada",None],
                 "confidence":[0.9,0.25,.83,0.34,None],
                 "prediction":[True,True,True,True,None]})

感谢您的帮助。

Answer 1

我更新了答案以匹配您提供的新信息。希望这会有所帮助。

import pandas as pd

df=pd.DataFrame({"id": [1,2,3,4,5],
               "Name": ["Dave","Max","Joe","Rose","Mark"],
               "model1":["Irish","German","USA","Japan","China"],
                "confidence1": [0.9,.99,.83,.45,.51],
                "prediction1": [True,False,True,False,False],
                 "model2":["Oman","Nigeria","India","Russia","Brazil"],
                 "confidence2": [0.1,.25,.26,.41,.01],
                 "prediction2": [False,True,False,False,False],
                 "model3":["Egypt","Cameron","Netherland","Canada","Mexcio"],
                 "confidence3": [0.01,.23,.12,.34,.61],
                 "prediction3": [True,False,True,True,False]})

tweet_id = []
name = []
Model = []
Breed = []
Confidence = []

for i in range(len(df['id'])):
    confidences = [df['confidence{0}'.format(model)][i] for model in range(1,4)]
    models = ['p{0}'.format(model) for model in range(1,4)]
    breeds = [df['model{0}'.format(model)][i] for model in range(1,4)]
    isDog = [df['prediction{0}'.format(model)][i] for model in range(1,4)]

    best_one = max(zip(confidences, models, breeds, isDog), key=lambda M: M[0])

    model = best_one[1]
    breed = best_one[2]
    confidence = best_one[0]

    if not (True in isDog):
        model = breed = confidence = 'NaN'

    tweet_id.append(df['id'][i])
    name.append(df['Name'][i])
    Model.append(model)
    Breed.append(breed)
    Confidence.append(confidence)

print(pd.DataFrame({
                'tweet_id': tweet_id,
                'name': name,
                'Model': Model,
                'Breed': Breed,
                'Confidence': Confidence
                }))

输出

   tweet_id  name Model   Breed Confidence
0         1  Dave    p1   Irish        0.9
1         2   Max    p1  German       0.99
2         3   Joe    p1     USA       0.83
3         4  Rose    p1   Japan       0.45
4         5  Mark   NaN     NaN        NaN

Answer 2

这是一种方法，

import numpy as np

df = pd.DataFrame({"id": [1,2,3,4,5],
                 "Name": ["Dave","Max","Joe","Rose","Mark"],
                 "model1":["Irish","German","USA","Japan","China"],
                 "confidence1": [0.9,.99,.83,.45,.51],
                 "prediction1": [True,False,True,False,False],
                 "model2":["Oman","Nigeria","India","Russia","Brazil"],
                 "confidence2": [0.1,.25,.26,.41,.01],
                 "prediction2": [False,True,False,False,False],
                 "model3":["Egypt","Cameron","Netherland","Canada","Mexcio"],
                 "confidence3": [0.01,.23,.12,.34,.61],
                 "prediction3": [True,False,True,True,False]})

df1 = df.copy()
cols = df1.filter(regex='model').columns

df1[cols] = df1[cols].apply(lambda x: x + "_" + x.index.str[-1], 1)

vals = df1.filter(regex='mod|conf|pred').values.reshape(-1,3,3)

lst = []
for i in vals:
    try:
        lst.append(max([j for j in i if True in j], key=lambda x: x[1]))
    except:
        lst.append([np.nan])

df1 = df1.join(pd.DataFrame(lst)).drop(df1.filter(regex='mod|conf|pred'), axis=1)
df1.columns = ['id', 'name', 'predicted_gender', 'confidence', 'prediction']

df1[['predicted_gender','model_name']]= df1['predicted_gender'].str.split('_',expand=True)

print (df1)

   id  name predicted_gender  confidence prediction model_name
0   1  Dave            Irish        0.90       True          1
1   2   Max          Nigeria        0.25       True          2
2   3   Joe              USA        0.83       True          1
3   4  Rose           Canada        0.34       True          3
4   5  Mark              NaN         NaN       None        NaN

Answer 3

下面的代码将添加得分最高的新列

df['Confidence'] = df[['pf1_conf','pf2_conf','pf3_conf']].max(axis=1)

您可以删除这6列。

del df['p1','pf1_conf','p2','pf2_conf','p3','pf3_conf']

寻找最佳模型并将其信息带入新的专栏

3 个答案: