将.iterrows()与series.nlargest()结合使用以获取数据框中一行的最高编号

时间:2018-08-02 05:14:08

标签: python pandas dataframe iterator

我正在尝试创建一个使用df.iterrows()Series.nlargest的函数。我想遍历每一行并找到最大的数字,然后将其标记为1。这是数据帧:

A   B    C
9   6    5
3   7    2

这是我希望获得的输出:

A    B   C
1    0   0
0    1   0

这是我想在这里使用的功能:

def get_top_n(df, top_n):
    """


    Parameters
    ----------
    df : DataFrame

    top_n : int
        The top number to get
    Returns
    -------
    top_numbers : DataFrame
    Returns the top number marked with a 1

    """
    # Implement Function
    for row in df.iterrows():
        top_numbers = row.nlargest(top_n).sum()

    return top_numbers

我收到以下错误: AttributeError:“ tuple”对象没有属性“ nlargest”

对于如何以更整洁的方式重新编写我的函数并使其真正起作用的帮助,我们将不胜感激。预先感谢

2 个答案:

答案 0 :(得分:6)

添加i变量,因为iterrows返回每一行带有Series的索引:

for i, row in df.iterrows():
    top_numbers = row.nlargest(top_n).sum()

使用numpy.argsort来解决descending order中位置的一般解决方案,然后将布尔数组进行比较并将其转换为整数:

def get_top_n(df, top_n):
    if top_n > len(df.columns):
        raise ValueError("Value is higher as number of columns")
    elif not isinstance(top_n, int):
        raise ValueError("Value is not integer")

    else:
        arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
        df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
        return (df1)

df1 = get_top_n(df, 2)
print (df1)
   A  B  C
0  1  1  0
1  1  1  0

df1 = get_top_n(df, 1)
print (df1)
   A  B  C
0  1  0  0
1  0  1  0

编辑:

可以使用iterrows解决方案,但不建议这样做,因为它很慢:

top_n = 2
for i, row in df.iterrows():
    top = row.nlargest(top_n).index
    df.loc[i] = 0
    df.loc[i, top] = 1

print (df)
   A  B  C
0  1  1  0
1  1  1  0

答案 1 :(得分:2)

就上下文而言,该数据框包含标准普尔500约4年的股票回报数据

def get_top_n(prev_returns, top_n):

    # generate dataframe populated with zeros for merging
    top_stocks = pd.DataFrame(0, columns = prev_returns.columns, index = prev_returns.index)

    # find top_n largest entries by row
    df = prev_returns.apply(lambda x: x.nlargest(top_n), axis=1)

    # merge dataframes
    top_stocks = top_stocks.merge(df, how = 'right').set_index(df.index)

    # return dataframe replacing non_zero answers with a 1
    return (top_stocks.notnull()) * 1