假设我有一个放在DataFrame中的750x750矩阵,比如df。
a = df.values
a.sort(axis=1)
sorted_table = a[:,-4::]
b = a[:,::-1]
我想找出每行包含4个最高值的列,我可以通过以下方式轻松完成:
[[ 98. 29. 15. 10.]
[ 93. 91. 75. 60.]
[ 48. 21. 17. 10.]
.
.
.
...]
但是,我得到的结果只是一个列表,没有索引和列名。
df=
c1 c512 c20 c57 c310
c2 c317 c133 c584 c80
c3 c499 c289 c703 c100
. . . . ... .
. . . . ... .
. . . . ... .
c750 c89 c31 c546 c107
如果我想知道哪个列名是引用的排序值,我该怎么办?
我想展示:
c512 is referring to 98
c20 is referring to 29
c57 is referring to 15
and so and so.
其中
{{1}}
答案 0 :(得分:1)
我怀疑这是最好的答案,但我认为它有效。我讨厌在熊猫中使用for
循环,但我想不到大熊猫的方法。
import pandas as pd
import numpy as np
#array_size = 10
#--- Generate Data and create toy Dataframe ---
array_size = 750
np.random.seed(1)
data = np.random.randint(0, 1000000, array_size**2)
data = data.reshape((array_size, array_size))
df = pd.DataFrame(data, columns=['c'+str(i) for i in range(1, (array_size)+1)])
df.index = df.columns
#--- Transpose the dataframe to more familiarly sort by columns instead of rows ---
df = df.T
#--- Rank values in dataframe using max method where highest value is rank 1 ---
df = df.rank(method='max', ascending=False)
#--- Create empty dataframe to put data into ---
new_df = pd.DataFrame()
#--- For loop for each column to get top ranks less than 5, sort them, reset index, drop i column
for i in df.columns:
s = df[i][df[i] < 5].sort_values().reset_index().drop(i, axis=1)
new_df = pd.concat([new_df, s.T])
#--- The new_df index will say 'index', this reassigns the transposed column names to new_df's index
new_df.index = df.columns
print(new_df)
输出:
0 1 2 3
c1 c479 c545 c614 c220
c2 c249 c535 c231 c680
c3 c657 c603 c137 c740
c4 c674 c424 c426 c127
... ... ... ... ...
c747 c251 c536 c321 c296
c748 c55 c383 c437 c103
c749 c138 c495 c299 c295
c750 c178 c556 c491 c445