我有一个包含以下列和行的数据集
Scored Probabilities for Class "1" Scored Probabilities for Class "2" Scored Probabilities for Class "3" Scored Labels
0.258471 0.009299 0.005433 1
0.154108 0.009577 0.527308 3
0.001949 0.634572 0.000953 2
(实际上,有17个" Classes",但我已将此帖子简化为3个)
我想添加一个额外的列,名为" Scored Label Probability"这是前三列的最大值(实际上,所有列的最大值被调用"类和#34的得分概率; X"")。所以结果应该是这样的: -
Scored Label Probability (new)
0.258471 0.009299 0.005433 1 0.258471
0.154108 0.009577 0.527308 3 0.527308
0.001949 0.634572 0.000953 2 0.634572
这是我的代码(下方)。不幸的是"评分标签" column(示例数据中的第4列)已损坏(由不同的整数数字替换)。 有关如何解决它的任何建议? 感谢
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
import pandas as pd
import numpy as np
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(df = None, df2 = None):
# First add the empty column
df['Scored Label Probability'] = 0.0
for rowindex, row in df.iterrows():
max_probability =0.0
column_value = 0.0
column_name = ''
for column_name, column_value in row.iteritems():
if column_name.startswith('Scored Probabilities for Class'):
if column_value>max_probability:
max_probability = column_value
# print (max_probability,max_prob_column_name)
df.set_value(rowindex,'Scored Label Probability',max_probability)
# Return value must be of a sequence of pandas.DataFrame
return df
答案 0 :(得分:3)
您可以在axis=1
(列)中使用DF.max
方法,它为您提供以匹配字符串开头的所有列的最高值(使用DF.filter
方法找到):
df.filter(like='Scored Probabilities for Class').max(axis=1)
0 0.258471
1 0.527308
2 0.634572
dtype: float64
为了使用R
执行相同操作,您可以使用pmax
函数返回以指定前缀开头的列的并行最大值。
另外使用dplyr
包,我们可以允许select
进行子集化,并借助starts_with
之类的字符串帮助程序来执行上述filter
等效操作。
library(dplyr)
df$max <- do.call(pmax, select(df, starts_with('Scored Probabilities for Class')))