我有一个像这样的df(数据代表一个矩阵):
Arnston Berg Carlson
Arnston 0.00 1.00 2.00
Berg 1.00 0.00 3.00
Carlson 2.00 3.00 0.00
我想转置它以便链接行名和列名,并且它们的关联值显示为新列,从最小到最大排序。我只需要保留一个行列组合,因为它们总是相同的(例如Arnston,Berg == 1.00和Berg,Arnston == 1.00)
我想要的输出是:
Arnston, Arnston 0.00
Berg, Berg 0.00
Carlson, Carlson 0.00
Arnston, Berg 1.00
Arnston, Carlson 2.00
Berg, Carlson 3.00
我希望这是有道理的。
答案 0 :(得分:4)
pandas melt功能非常棒。
在:
transport
出:
df = df.reset_index() #Make your index into a column
df = pd.melt(df, id_vars = ['index']) #Reshape data
df = df[df['index'] <= df['variable']].sort_values(by = 'value') #Remove duplicates, sort
df ['col'] = df['index'] +','+ df['variable'] #Concatenate strings
df = df[['col','value']] #Remove unnecessary columns
df = df.set_index('col') #Set strings to index
df
答案 1 :(得分:0)
我假设您的矩阵是对称的,因此您可以使用嵌套循环构建上对角矩阵的索引列表和值列表。但是,第二个循环应该从内部循环的值开始。
vals = []
idx = []
for i in range(df.shape[0]):
for j in range(i, df.shape[1]):
idx.append((df.index[i], df.columns[j]))
vals.append(df.iat[i, j])
>>> pd.Series(vals, index=idx)
(Arnston, Arnston) 0
(Arnston, Berg) 1
(Arnston, Carlson) 2
(Berg, Berg) 0
(Berg, Carlson) 3
(Carlson, Carlson) 0
dtype: float64
进行一些时序比较:
dfc = df.copy()
# Nested loop.
%%timeit
vals = []
idx = []
for i in range(dfc.shape[0]):
for j in range(i, dfc.shape[1]):
idx.append((dfc.index[i], dfc.columns[j]))
vals.append(dfc.iat[i, j])
pd.Series(vals, index=idx)
1000 loops, best of 3: 187 µs per loop
# Melt.
%%timeit
df = dfc.reset_index()
df = pd.melt(df,id_vars=['index'])
df = df[df['index']<=df['variable']].sort_values(by='value')
df ['col'] = df['index'] +','+ df['variable']
df = df[['col','value']]
df = df.set_index('col')
100 loops, best of 3: 3.39 ms per loop
对于更大的100x100对称矩阵,时间相反,其中melt
融合了竞争对手:
df = pd.DataFrame(np.random.randn(100, 100))
for i in range(df.shape[0]):
df.iat[i, i] = 1
for j in range(i + 1, df.shape[1]):
df.iat[i, j] = df.iat[j, i]
df.columns = df.index = ['col_' + str(i) for i in range(100)]
dfc = df.copy()
# nested loop:
10 loops, best of 3: 55.2 ms per loop
# melt:
100 loops, best of 3: 5.72 ms per loop
答案 2 :(得分:0)
这是使用numpy
:
%%timeit
df = pd.DataFrame([['Arnston', 0.0, 1.0, 2.0],
['Berg', 1.0, 0.0, 3.0],
['Carlson', 2.0, 3.0, 0.0]],
columns=['Name','Arnston','Berg','Carlson'])
df.set_index('Name', inplace=True)
upper = np.triu_indices_from(df.as_matrix()) #indices from upper tri
vals = df.as_matrix()[upper] #vals at upper inds
idx = [(df.index[i], df.columns[j]) for i,j in zip(upper[0],upper[1])]
# w/ numpy
1000 loops, best of 3: 810 µs per loop
结果:
In [11]: pd.Series(vals, index=idx)
Out[11]:
(Arnston, Arnston) 0
(Arnston, Berg) 1
(Arnston, Carlson) 2
(Berg, Berg) 0
(Berg, Carlson) 3
(Carlson, Carlson) 0
dtype: float64
当你在亚历山大大的dfc
:
%%timeit
upper = np.triu_indices_from(dfc.as_matrix()) #indices from upper tri
vals = dfc.as_matrix()[upper] #vals at upper inds
idx = [(dfc.index[i], dfc.columns[j]) for i,j in zip(upper[0],upper[1])]
100 loops, best of 3: 15.3 ms per loop
不如melt
快。