我正尝试展开一列中的列表,以添加更多行,以将其输入到swarmplot中。
现在,我建立一个列表字典:
# store all list of metrics
clf_aucs = dict()
_list = np.arange(0, 500) # build dummy list of floats
clf_aucs[id] = _list
这本词典说5个键,每个键有500个浮点数的列表。下次创建数据框时:
clf_aucs_df = pd.DataFrame(clf_aucs,
).transpose()
clf_aucs_df = clf_aucs_df.reset_index()
display(clf_aucs_df.head())
print(clf_aucs_df.shape)
结果如下:
index 0 1 2 3 4 5 6 7 8 ... 490 491 492 493 494 495 496 497 498 499
0 clf0 0.432609 0.398760 0.292517 0.411905 0.385375 0.390023 0.364286 0.364035 0.450000 ... 0.477273 0.355372 0.378000 0.386667 0.396104 0.395085 0.426667 0.461957 0.402746 0.445238
1 clf1 0.432900 0.231602 0.416149 0.365217 0.414286 0.461039 0.325217 0.357143 0.447826 ... 0.402893 0.323913 0.420949 0.434783 0.372294 0.360417 0.410208 0.420949 0.392857 0.343685
2 clf2 0.322314 0.400000 0.409524 0.405797 0.466942 0.383399 0.478261 0.405896 0.432892 ... 0.371542 0.494318 0.493750 0.415238 0.414079 0.400433 0.402778 0.493478 0.478261 0.458498
3 clf3 0.509921 0.579051 0.545455 0.658103 0.576560 0.500000 0.515810 0.505682 0.525880 ... 0.590909 0.553360 0.409938 0.462585 0.584348 0.575397 0.472332 0.513834 0.587500 0.612500
4 clf4 0.474206 0.490451 0.479437 0.593750 0.545455 0.580357 0.484127 0.596273 0.537549 ... 0.665909 0.545351 0.609375 0.556277 0.531522 0.511905 0.583851 0.543478 0.513889 0.583333
5 rows × 501 columns
我的问题是如何合并0-499列,以便新数据框将是2500行x 2列,其中包含id列和数字列。
其他尝试:
答案 0 :(得分:2)
我相信您正在寻找的是pd.melt:
import numpy as np
import pandas as pd
# recreate DataFrame from example
clf_aucs = dict()
for id_ in range(5):
clf_aucs[f"clf{id_}"] = np.random.uniform(size=(500, ))
clf_aucs_df = pd.DataFrame(clf_aucs).T.reset_index().rename(
columns={"index": "ID"})
# melt DataFrame
clf_aucs_df = pd.melt(clf_aucs_df, id_vars="ID", value_name="Numerical_Column")
# drop what were the column names prior to reshaping the DataFrame
clf_aucs_df.drop(columns="variable", inplace=True)
# sort first on ID and then on Numerical_Column
clf_aucs_df.sort_values(["ID", "Numerical_Column"], inplace=True)
# reindex from 0
clf_aucs_df.reset_index(drop=True, inplace=True)
输入是:
ID 0 1 2 ... 496 497 498 499
0 clf0 0.647251 0.976586 0.675573 ... 0.911264 0.983211 0.685464 0.519285
1 clf1 0.034560 0.340834 0.443456 ... 0.412356 0.968721 0.833882 0.634775
2 clf2 0.723530 0.087285 0.014977 ... 0.563904 0.962543 0.860245 0.679423
3 clf3 0.863781 0.609096 0.214915 ... 0.382548 0.798677 0.196336 0.673109
4 clf4 0.185867 0.006018 0.635887 ... 0.622308 0.802546 0.771671 0.536761
,输出为:
ID Numerical_Column
0 clf0 0.000779
1 clf0 0.001084
2 clf0 0.001478
3 clf0 0.004019
4 clf0 0.004034
... ... ...
2495 clf4 0.996943
2496 clf4 0.998093
2497 clf4 0.998384
2498 clf4 0.999620
2499 clf4 0.999668
答案 1 :(得分:1)
一个班轮:
pd.DataFrame(data_dict).T.stack().reset_index().drop(columns=['level_1'])
工作原理,分步进行:
>>> data = {'clf0': [1,2,3,4], 'clf1': [5,6,7,8]}
>>> df = pd.DataFrame(data)
>>> df
clf0 clf1
0 1 5
1 2 6
2 3 7
3 4 8
>>> df.T.stack().reset_index()
level_0 level_1 0
0 clf0 0 1
1 clf0 1 2
2 clf0 2 3
3 clf0 3 4
4 clf1 0 5
5 clf1 1 6
6 clf1 2 7
7 clf1 3 8
>>> # former index is now 'level_1', values are in columns '0'
>>> df.T.stack().reset_index().drop(columns=['level_1'])
level_0 0
0 clf0 1
1 clf0 2
2 clf0 3
3 clf0 4
4 clf1 5
5 clf1 6
6 clf1 7
7 clf1 8