转置数据框和排序

时间:2016-02-09 19:14:16

标签: python pandas

我有一个像这样的df(数据代表一个矩阵):

           Arnston    Berg    Carlson
Arnston    0.00       1.00    2.00
Berg       1.00       0.00    3.00
Carlson    2.00       3.00    0.00

我想转置它以便链接行名和列名,并且它们的关联值显示为新列,从最小到最大排序。我只需要保留一个行列组合,因为它们总是相同的(例如Arnston,Berg == 1.00和Berg,Arnston == 1.00)

我想要的输出是:

Arnston, Arnston   0.00
Berg, Berg         0.00
Carlson, Carlson   0.00
Arnston, Berg      1.00
Arnston, Carlson   2.00
Berg, Carlson      3.00

我希望这是有道理的。

3 个答案:

答案 0 :(得分:4)

pandas melt功能非常棒。

在:

transport

出:

df = df.reset_index() #Make your index into a column
df = pd.melt(df, id_vars = ['index']) #Reshape data
df = df[df['index'] <= df['variable']].sort_values(by = 'value') #Remove duplicates, sort
df ['col'] = df['index'] +','+ df['variable'] #Concatenate strings
df = df[['col','value']] #Remove unnecessary columns
df = df.set_index('col') #Set strings to index
df

答案 1 :(得分:0)

我假设您的矩阵是对称的,因此您可以使用嵌套循环构建上对角矩阵的索引列表和值列表。但是,第二个循环应该从内部循环的值开始。

vals = []
idx = []
for i in range(df.shape[0]):
    for j in range(i, df.shape[1]):
        idx.append((df.index[i], df.columns[j]))
        vals.append(df.iat[i, j])
>>> pd.Series(vals, index=idx)
(Arnston, Arnston)    0
(Arnston, Berg)       1
(Arnston, Carlson)    2
(Berg, Berg)          0
(Berg, Carlson)       3
(Carlson, Carlson)    0
dtype: float64

进行一些时序比较:

dfc = df.copy()

# Nested loop.
%%timeit
vals = []
idx = []
for i in range(dfc.shape[0]):
    for j in range(i, dfc.shape[1]):
        idx.append((dfc.index[i], dfc.columns[j]))
        vals.append(dfc.iat[i, j])
pd.Series(vals, index=idx)
1000 loops, best of 3: 187 µs per loop

# Melt.
%%timeit
df = dfc.reset_index()
df = pd.melt(df,id_vars=['index'])
df = df[df['index']<=df['variable']].sort_values(by='value')
df ['col'] = df['index'] +','+ df['variable']
df = df[['col','value']]
df = df.set_index('col')
100 loops, best of 3: 3.39 ms per loop

对于更大的100x100对称矩阵,时间相反,其中melt融合了竞争对手:

df = pd.DataFrame(np.random.randn(100, 100))
for i in range(df.shape[0]):
    df.iat[i, i] = 1
    for j in range(i + 1, df.shape[1]):
        df.iat[i, j] = df.iat[j, i]
df.columns = df.index = ['col_' + str(i) for i in range(100)]
dfc = df.copy()

# nested loop:
10 loops, best of 3: 55.2 ms per loop

# melt:
100 loops, best of 3: 5.72 ms per loop

答案 2 :(得分:0)

这是使用numpy

的人
%%timeit
df = pd.DataFrame([['Arnston', 0.0, 1.0, 2.0],
               ['Berg', 1.0, 0.0, 3.0],
               ['Carlson', 2.0, 3.0, 0.0]],
                columns=['Name','Arnston','Berg','Carlson'])

df.set_index('Name', inplace=True)

upper = np.triu_indices_from(df.as_matrix())  #indices from upper tri
vals = df.as_matrix()[upper] #vals at upper inds
idx = [(df.index[i], df.columns[j]) for i,j in zip(upper[0],upper[1])]

# w/ numpy
1000 loops, best of 3: 810 µs per loop

结果:

In [11]: pd.Series(vals, index=idx)
Out[11]:    
        (Arnston, Arnston)    0
        (Arnston, Berg)       1
        (Arnston, Carlson)    2
        (Berg, Berg)          0
        (Berg, Carlson)       3
        (Carlson, Carlson)    0
        dtype: float64

当你在亚历山大大的dfc

上运行它时
%%timeit
upper = np.triu_indices_from(dfc.as_matrix())  #indices from upper tri
vals = dfc.as_matrix()[upper] #vals at upper inds
idx = [(dfc.index[i], dfc.columns[j]) for i,j in zip(upper[0],upper[1])]

100 loops, best of 3: 15.3 ms per loop

不如melt快。