我正在尝试创建一个使用df.iterrows()
和Series.nlargest
的函数。我想遍历每一行并找到最大的数字,然后将其标记为1
。这是数据帧:
A B C
9 6 5
3 7 2
这是我希望获得的输出:
A B C
1 0 0
0 1 0
这是我想在这里使用的功能:
def get_top_n(df, top_n):
"""
Parameters
----------
df : DataFrame
top_n : int
The top number to get
Returns
-------
top_numbers : DataFrame
Returns the top number marked with a 1
"""
# Implement Function
for row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
return top_numbers
我收到以下错误: AttributeError:“ tuple”对象没有属性“ nlargest”
对于如何以更整洁的方式重新编写我的函数并使其真正起作用的帮助,我们将不胜感激。预先感谢
答案 0 :(得分:6)
添加i
变量,因为iterrows
返回每一行带有Series
的索引:
for i, row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
使用numpy.argsort
来解决descending order中位置的一般解决方案,然后将布尔数组进行比较并将其转换为整数:
def get_top_n(df, top_n):
if top_n > len(df.columns):
raise ValueError("Value is higher as number of columns")
elif not isinstance(top_n, int):
raise ValueError("Value is not integer")
else:
arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
return (df1)
df1 = get_top_n(df, 2)
print (df1)
A B C
0 1 1 0
1 1 1 0
df1 = get_top_n(df, 1)
print (df1)
A B C
0 1 0 0
1 0 1 0
编辑:
可以使用iterrows
解决方案,但不建议这样做,因为它很慢:
top_n = 2
for i, row in df.iterrows():
top = row.nlargest(top_n).index
df.loc[i] = 0
df.loc[i, top] = 1
print (df)
A B C
0 1 1 0
1 1 1 0
答案 1 :(得分:2)
就上下文而言,该数据框包含标准普尔500约4年的股票回报数据
def get_top_n(prev_returns, top_n):
# generate dataframe populated with zeros for merging
top_stocks = pd.DataFrame(0, columns = prev_returns.columns, index = prev_returns.index)
# find top_n largest entries by row
df = prev_returns.apply(lambda x: x.nlargest(top_n), axis=1)
# merge dataframes
top_stocks = top_stocks.merge(df, how = 'right').set_index(df.index)
# return dataframe replacing non_zero answers with a 1
return (top_stocks.notnull()) * 1