计算添加列时,Pandas DataFrame列已损坏

时间:2017-01-13 18:51:56

标签: python pandas

我有一个包含以下列和行的数据集

Scored Probabilities for Class "1"  Scored Probabilities for Class "2"  Scored Probabilities for Class "3"  Scored Labels
0.258471                0.009299                0.005433                1
0.154108                0.009577                0.527308                3
0.001949                0.634572                0.000953                2

(实际上,有17个" Classes",但我已将此帖子简化为3个)

我想添加一个额外的列,名为" Scored Label Probability"这是前三列的最大值(实际上,所有列的最大值被调用"类和#34的得分概率; X"")。所以结果应该是这样的: -

                                        Scored Label Probability (new)
0.258471    0.009299    0.005433    1   0.258471
0.154108    0.009577    0.527308    3   0.527308
0.001949    0.634572    0.000953    2   0.634572

这是我的代码(下方)。不幸的是"评分标签" column(示例数据中的第4列)已损坏(由不同的整数数字替换)。 有关如何解决它的任何建议? 感谢

# The script MUST contain a function named azureml_main
# which is the entry point for this module.

import pandas as pd
import numpy as np

# The entry point function can contain up to two input arguments:
#   Param<dataframe1>: a pandas.DataFrame
#   Param<dataframe2>: a pandas.DataFrame
def azureml_main(df = None, df2 = None):

    # First add the empty column
    df['Scored Label Probability'] = 0.0

    for rowindex, row in df.iterrows():
        max_probability =0.0
        column_value = 0.0
        column_name = ''
        for column_name, column_value in row.iteritems():
            if column_name.startswith('Scored Probabilities for Class'):
                if column_value>max_probability:
                    max_probability = column_value

        # print (max_probability,max_prob_column_name)
        df.set_value(rowindex,'Scored Label Probability',max_probability)

    # Return value must be of a sequence of pandas.DataFrame
    return df

1 个答案:

答案 0 :(得分:3)

您可以在axis=1(列)中使用DF.max方法,它为您提供以匹配字符串开头的所有列的最高值(使用DF.filter方法找到):

df.filter(like='Scored Probabilities for Class').max(axis=1)

0    0.258471
1    0.527308
2    0.634572
dtype: float64

为了使用R执行相同操作,您可以使用pmax函数返回以指定前缀开头的列的并行最大值。

另外使用dplyr包,我们可以允许select进行子集化,并借助starts_with之类的字符串帮助程序来执行上述filter等效操作。

library(dplyr)
df$max <- do.call(pmax, select(df, starts_with('Scored Probabilities for Class')))