我有一个df,如下所示。
p1_conf
,p2_conf
和p3_conf
分别显示了模型p1
,p2
和p3
的置信区间。
我想知道如何选择每行具有最高置信区间的预测并将其存储在一些新列中。因此结果将为:
您可以将df以下的内容用作原始df:
df = pd.DataFrame({"id": [1,2,3,4,5],
"Name": ["Dave","Max","Joe","Rose","Mark"],
"model1":["Irish","German","USA","Japan","China"],
"confidence1": [0.9,.99,.83,.45,.51],
"prediction1": [True,False,True,False,False],
"model2":["Oman","Nigeria","India","Russia","Brazil"],
"confidence2": [0.1,.25,.26,.41,.01],
"prediction2": [False,True,False,False,False],
"model3":["Egypt","Cameron","Netherland","Canada","Mexcio"],
"confidence3": [0.01,.23,.12,.34,.61],
"prediction3": [True,False,True,True,False]})
结果应该是这样的:
df1 = pd.DataFrame({"id": [1,2,3,4,5],
"Name":["Dave","Max","Joe","Rose","Mark"],
"model_name":["1","2","1","3",None],
"predicted_gener":["Irish","Nigeria","USA","Canada",None],
"confidence":[0.9,0.25,.83,0.34,None],
"prediction":[True,True,True,True,None]})
感谢您的帮助。
答案 0 :(得分:1)
我更新了答案以匹配您提供的新信息。希望这会有所帮助。
import pandas as pd
df=pd.DataFrame({"id": [1,2,3,4,5],
"Name": ["Dave","Max","Joe","Rose","Mark"],
"model1":["Irish","German","USA","Japan","China"],
"confidence1": [0.9,.99,.83,.45,.51],
"prediction1": [True,False,True,False,False],
"model2":["Oman","Nigeria","India","Russia","Brazil"],
"confidence2": [0.1,.25,.26,.41,.01],
"prediction2": [False,True,False,False,False],
"model3":["Egypt","Cameron","Netherland","Canada","Mexcio"],
"confidence3": [0.01,.23,.12,.34,.61],
"prediction3": [True,False,True,True,False]})
tweet_id = []
name = []
Model = []
Breed = []
Confidence = []
for i in range(len(df['id'])):
confidences = [df['confidence{0}'.format(model)][i] for model in range(1,4)]
models = ['p{0}'.format(model) for model in range(1,4)]
breeds = [df['model{0}'.format(model)][i] for model in range(1,4)]
isDog = [df['prediction{0}'.format(model)][i] for model in range(1,4)]
best_one = max(zip(confidences, models, breeds, isDog), key=lambda M: M[0])
model = best_one[1]
breed = best_one[2]
confidence = best_one[0]
if not (True in isDog):
model = breed = confidence = 'NaN'
tweet_id.append(df['id'][i])
name.append(df['Name'][i])
Model.append(model)
Breed.append(breed)
Confidence.append(confidence)
print(pd.DataFrame({
'tweet_id': tweet_id,
'name': name,
'Model': Model,
'Breed': Breed,
'Confidence': Confidence
}))
输出
tweet_id name Model Breed Confidence
0 1 Dave p1 Irish 0.9
1 2 Max p1 German 0.99
2 3 Joe p1 USA 0.83
3 4 Rose p1 Japan 0.45
4 5 Mark NaN NaN NaN
答案 1 :(得分:1)
这是一种方法,
import numpy as np
df = pd.DataFrame({"id": [1,2,3,4,5],
"Name": ["Dave","Max","Joe","Rose","Mark"],
"model1":["Irish","German","USA","Japan","China"],
"confidence1": [0.9,.99,.83,.45,.51],
"prediction1": [True,False,True,False,False],
"model2":["Oman","Nigeria","India","Russia","Brazil"],
"confidence2": [0.1,.25,.26,.41,.01],
"prediction2": [False,True,False,False,False],
"model3":["Egypt","Cameron","Netherland","Canada","Mexcio"],
"confidence3": [0.01,.23,.12,.34,.61],
"prediction3": [True,False,True,True,False]})
df1 = df.copy()
cols = df1.filter(regex='model').columns
df1[cols] = df1[cols].apply(lambda x: x + "_" + x.index.str[-1], 1)
vals = df1.filter(regex='mod|conf|pred').values.reshape(-1,3,3)
lst = []
for i in vals:
try:
lst.append(max([j for j in i if True in j], key=lambda x: x[1]))
except:
lst.append([np.nan])
df1 = df1.join(pd.DataFrame(lst)).drop(df1.filter(regex='mod|conf|pred'), axis=1)
df1.columns = ['id', 'name', 'predicted_gender', 'confidence', 'prediction']
df1[['predicted_gender','model_name']]= df1['predicted_gender'].str.split('_',expand=True)
print (df1)
id name predicted_gender confidence prediction model_name
0 1 Dave Irish 0.90 True 1
1 2 Max Nigeria 0.25 True 2
2 3 Joe USA 0.83 True 1
3 4 Rose Canada 0.34 True 3
4 5 Mark NaN NaN None NaN
答案 2 :(得分:0)
下面的代码将添加得分最高的新列
df['Confidence'] = df[['pf1_conf','pf2_conf','pf3_conf']].max(axis=1)
您可以删除这6列。
del df['p1','pf1_conf','p2','pf2_conf','p3','pf3_conf']