我写了以下代码:
df = pd.read_csv('breast-cancer-wisconsin.data.csv', nrows= 300)
columns_df = pd.DataFrame(columns = ['column_name', 'value'])
columns = df.columns.values
for index, row in df.iterrows():
for column in columns:
columns_df = columns_df.append({'column_name': column, 'value': row[column]}, ignore_index=True)
此脚本将csv文件读取到pandas数据框,然后将每个列名称及其对应的值附加到新的数据框。如果我跑
print(columns_df[-10:])
我得到以下输出
column_name value
2300 clump_thickness 5
2301 unif_cell_size 5
2302 unif_cell_shape 7
2303 marg_adhesion 8
2304 single_epith_cell_size 6
2305 bare_nuclei 10
2306 bland_chrom 7
2307 norm_nucleoli 4
2308 mitoses 1
2309 class 4
2310 clump_thickness 5
2311 unif_cell_size 3
2312 unif_cell_shape 4
2313 marg_adhesion 3
2314 single_epith_cell_size 4
2315 bare_nuclei 5
2316 bland_chrom 4
2317 norm_nucleoli 7
2318 mitoses 1
2319 class 2
2320 clump_thickness 5
2321 unif_cell_size 4
2322 unif_cell_shape 3
2323 marg_adhesion 1
2324 single_epith_cell_size 2
2325 bare_nuclei -99999
2326 bland_chrom 2
2327 norm_nucleoli 3
2328 mitoses 1
2329 class 2
2330 clump_thickness 8
2331 unif_cell_size 2
2332 unif_cell_shape 1
2333 marg_adhesion 1
2334 single_epith_cell_size 5
2335 bare_nuclei 1
2336 bland_chrom 1
2337 norm_nucleoli 1
2338 mitoses 1
2339 class 2
2340 clump_thickness 9
2341 unif_cell_size 1
2342 unif_cell_shape 2
2343 marg_adhesion 6
2344 single_epith_cell_size 4
2345 bare_nuclei 10
2346 bland_chrom 7
2347 norm_nucleoli 7
2348 mitoses 2
2349 class 4
不幸的是,此脚本不是很快,并且要花很长时间才能处理大型数据帧。
问题:是否有更优雅/更快的方法来实现这些结果?
来自我的输入数据(csv文件)的示例:
id,clump_thickness,unif_cell_size,unif_cell_shape,
marg_adhesion,single_epith_cell_size,
bare_nuclei,bland_chrom,norm_nucleoli,mitoses,class
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4
1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2
1033078,2,1,1,1,2,1,1,1,5,2
1033078,4,2,1,1,2,1,2,1,1,2
1035283,1,1,1,1,1,1,3,1,1,2
1036172,2,1,1,1,2,1,2,1,1,2
1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2
1044572,8,7,5,10,7,9,5,5,4,4
1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2
1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4
1050718,6,1,1,1,2,1,3,1,1,2
1054590,7,3,2,10,5,10,5,4,4,4
1054593,10,5,5,3,6,7,7,10,1,4
1056784,3,1,1,1,2,1,2,1,1,2
1057013,8,4,5,1,2,?,7,3,1,4
1059552,1,1,1,1,2,1,3,1,1,2
1065726,5,2,3,4,2,7,3,6,1,4
1066373,3,2,1,1,1,1,2,1,1,2
1066979,5,1,1,1,2,1,2,1,1,2
1067444,2,1,1,1,2,1,2,1,1,2
1070935,1,1,3,1,2,1,1,1,1,2
1070935,3,1,1,1,1,1,2,1,1,2
1071760,2,1,1,1,2,1,3,1,1,2
1072179,10,7,7,3,8,5,7,4,3,4
1074610,2,1,1,2,2,1,3,1,1,2
1075123,3,1,2,1,2,1,2,1,1,2
答案 0 :(得分:1)
如果性能很重要,请使用melt
,但值的顺序会更改:
df = df.melt('id')
print (df.head())
id variable value
0 1000025 clump_thickness 5
1 1002945 clump_thickness 5
2 1015425 clump_thickness 3
3 1016277 clump_thickness 6
4 1017023 clump_thickness 4
或使用:
df = df.set_index('id').stack().rename_axis(['id','var']).reset_index(name='val')
print (df.head())
id var val
0 1000025 clump_thickness 5
1 1000025 unif_cell_size 1
2 1000025 unif_cell_shape 1
3 1000025 marg_adhesion 1
4 1000025 single_epith_cell_size 2