创建数据框,其中每一行都由列名和另一个数据框的值组成

时间:2018-09-28 07:40:44

标签: python-3.x pandas dataframe

我写了以下代码:

df = pd.read_csv('breast-cancer-wisconsin.data.csv', nrows= 300)
columns_df = pd.DataFrame(columns = ['column_name', 'value'])
columns = df.columns.values
for index, row in df.iterrows():
    for column in columns:
        columns_df = columns_df.append({'column_name': column, 'value': row[column]}, ignore_index=True)

此脚本将csv文件读取到pandas数据框,然后将每个列名称及其对应的值附加到新的数据框。如果我跑

print(columns_df[-10:])

我得到以下输出

                 column_name   value
2300         clump_thickness       5
2301          unif_cell_size       5
2302         unif_cell_shape       7
2303           marg_adhesion       8
2304  single_epith_cell_size       6
2305             bare_nuclei      10
2306             bland_chrom       7
2307           norm_nucleoli       4
2308                 mitoses       1
2309                   class       4
2310         clump_thickness       5
2311          unif_cell_size       3
2312         unif_cell_shape       4
2313           marg_adhesion       3
2314  single_epith_cell_size       4
2315             bare_nuclei       5
2316             bland_chrom       4
2317           norm_nucleoli       7
2318                 mitoses       1
2319                   class       2
2320         clump_thickness       5
2321          unif_cell_size       4
2322         unif_cell_shape       3
2323           marg_adhesion       1
2324  single_epith_cell_size       2
2325             bare_nuclei  -99999
2326             bland_chrom       2
2327           norm_nucleoli       3
2328                 mitoses       1
2329                   class       2
2330         clump_thickness       8
2331          unif_cell_size       2
2332         unif_cell_shape       1
2333           marg_adhesion       1
2334  single_epith_cell_size       5
2335             bare_nuclei       1
2336             bland_chrom       1
2337           norm_nucleoli       1
2338                 mitoses       1
2339                   class       2
2340         clump_thickness       9
2341          unif_cell_size       1
2342         unif_cell_shape       2
2343           marg_adhesion       6
2344  single_epith_cell_size       4
2345             bare_nuclei      10
2346             bland_chrom       7
2347           norm_nucleoli       7
2348                 mitoses       2
2349                   class       4

不幸的是,此脚本不是很快,并且要花很长时间才能处理大型数据帧。

问题:是否有更优雅/更快的方法来实现这些结果?

来自我的输入数据(csv文件)的示例:

id,clump_thickness,unif_cell_size,unif_cell_shape,
marg_adhesion,single_epith_cell_size,
bare_nuclei,bland_chrom,norm_nucleoli,mitoses,class  
    1000025,5,1,1,1,2,1,3,1,1,2
    1002945,5,4,4,5,7,10,3,2,1,2
    1015425,3,1,1,1,2,2,3,1,1,2
    1016277,6,8,8,1,3,4,3,7,1,2
    1017023,4,1,1,3,2,1,3,1,1,2
    1017122,8,10,10,8,7,10,9,7,1,4
    1018099,1,1,1,1,2,10,3,1,1,2
    1018561,2,1,2,1,2,1,3,1,1,2
    1033078,2,1,1,1,2,1,1,1,5,2
    1033078,4,2,1,1,2,1,2,1,1,2
    1035283,1,1,1,1,1,1,3,1,1,2
    1036172,2,1,1,1,2,1,2,1,1,2
    1041801,5,3,3,3,2,3,4,4,1,4
    1043999,1,1,1,1,2,3,3,1,1,2
    1044572,8,7,5,10,7,9,5,5,4,4
    1047630,7,4,6,4,6,1,4,3,1,4
    1048672,4,1,1,1,2,1,2,1,1,2
    1049815,4,1,1,1,2,1,3,1,1,2
    1050670,10,7,7,6,4,10,4,1,2,4
    1050718,6,1,1,1,2,1,3,1,1,2
    1054590,7,3,2,10,5,10,5,4,4,4
    1054593,10,5,5,3,6,7,7,10,1,4
    1056784,3,1,1,1,2,1,2,1,1,2
    1057013,8,4,5,1,2,?,7,3,1,4
    1059552,1,1,1,1,2,1,3,1,1,2
    1065726,5,2,3,4,2,7,3,6,1,4
    1066373,3,2,1,1,1,1,2,1,1,2
    1066979,5,1,1,1,2,1,2,1,1,2
    1067444,2,1,1,1,2,1,2,1,1,2
    1070935,1,1,3,1,2,1,1,1,1,2
    1070935,3,1,1,1,1,1,2,1,1,2
    1071760,2,1,1,1,2,1,3,1,1,2
    1072179,10,7,7,3,8,5,7,4,3,4
    1074610,2,1,1,2,2,1,3,1,1,2
    1075123,3,1,2,1,2,1,2,1,1,2

1 个答案:

答案 0 :(得分:1)

如果性能很重要,请使用melt,但值的顺序会更改:

df = df.melt('id')
print (df.head())
        id         variable value
0  1000025  clump_thickness     5
1  1002945  clump_thickness     5
2  1015425  clump_thickness     3
3  1016277  clump_thickness     6
4  1017023  clump_thickness     4

或使用:

df = df.set_index('id').stack().rename_axis(['id','var']).reset_index(name='val')
print (df.head())
        id                     var val
0  1000025         clump_thickness   5
1  1000025          unif_cell_size   1
2  1000025         unif_cell_shape   1
3  1000025           marg_adhesion   1
4  1000025  single_epith_cell_size   2