我有以下pandas数据框:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
ind
NA06984 -0.0082 -0.0594 -0.0148 -0.0569 -0.1128 -0.0276 -0.0217
NA06986 -0.0131 -0.0659 -0.0426 0.0654 0.0473 0.0603 -0.0454
NA06989 -0.0073 -0.0551 -0.0457 0.0971 -0.0051 -0.0123 0.0035
NA06994 -0.0051 -0.0599 -0.0239 0.0930 0.0765 0.0321 0.0392
NA07000 -0.0046 -0.0362 0.0006 -0.0639 -0.0197 -0.0132 0.0631
NA07037 -0.0132 -0.0600 -0.0252 -0.0381 -0.0091 0.0005 0.0235
NA07048 -0.0128 -0.0653 -0.0234 -0.0417 0.0233 0.1034 0.0180
NA07051 -0.0028 -0.0591 -0.0117 -0.0791 -0.0387 0.0102 -0.0840
NA07056 -0.0121 -0.0389 0.0113 -0.0754 0.0226 -0.0304 -0.0490
NA07347 -0.0192 -0.0441 -0.0588 0.1099 -0.0414 0.0505 0.0295
NA07357 -0.0100 -0.0360 -0.0268 -0.0621 -0.0737 -0.0090 0.0379
我希望标准化每列的分布,即应用公式
column_i [row_j] - column_i.mean())/ column_i.std()
表示每列中的每个值,并用这些值替换原始数据框。
到目前为止,我已经提出了
for index, row in evec_pandas.iterrows():
new_row = None
evec_pandas.loc[index,'PC1'] = (row['PC1'] - evec_pandas['PC1'].mean()) / evec_pandas['PC1'].std()
print evec_pandas
但结果是
PC1 PC2 PC3 PC4 PC5 PC6 PC7
ind
NA06984 0.343471 -0.0594 -0.0148 -0.0569 -0.1128 -0.0276 -0.0217
NA06986 -0.330077 -0.0659 -0.0426 0.0654 0.0473 0.0603 -0.0454
NA06989 -0.003975 -0.0551 -0.0457 0.0971 -0.0051 -0.0123 0.0035
NA06994 0.008607 -0.0599 -0.0239 0.0930 0.0765 0.0321 0.0392
NA07000 0.003659 -0.0362 0.0006 -0.0639 -0.0197 -0.0132 0.0631
NA07037 -0.058300 -0.0600 -0.0252 -0.0381 -0.0091 0.0005 0.0235
NA07048 -0.028319 -0.0653 -0.0234 -0.0417 0.0233 0.1034 0.0180
NA07051 0.046818 -0.0591 -0.0117 -0.0791 -0.0387 0.0102 -0.0840
NA07056 -0.043817 -0.0389 0.0113 -0.0754 0.0226 -0.0304 -0.0490
NA07347 -0.071195 -0.0441 -0.0588 0.1099 -0.0414 0.0505 0.0295
NA07357 0.019495 -0.0360 -0.0268 -0.0621 -0.0737 -0.0090 0.0379
第一个值是正确的(0.343471),但PC1列中的其余值不是,当然其余列没有变化。如果我使用:
for index, row in evec_pandas.iterrows():
new_row = None
new_row = (row['PC1'] - evec_pandas['PC1'].mean()) / evec_pandas['PC1'].std()
print new_row
我确实获得了PC1列,但作为一个独立的对象,不在数据框内:
0.343471311655
-0.673732188246
0.530304607555
0.987008219756
1.09080449526
-0.694491443346
-0.611454422946
1.46447108706
-0.466139637246
-1.94004674935
-0.0301952801455
所以我需要用这些值替换PC1,然后对每列做同样的事情;我曾想过像
这样的东西for index, column in evec_pandas.iteritems():
for index, row in evec_pandas.iterrows():
new_row = None
evec_pandas.loc[index,column] = (row[column] - evec_pandas[column].mean()) / evec_pandas[column].std()
但我知道它不会像这样工作。有什么想法吗?
所需的输出是:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
NA06984 0.34347131 -0.5760881 0.439607045 -0.6710009 -1.8594019 -1.0130591 -0.50633142
NA06986 -0.67373219 -1.1365003 -0.929352573 0.9013689 1.0906816 1.0794999 -1.02745500
NA06989 0.53030461 -0.2053539 -1.082006343 1.3089251 0.1251327 -0.6488253 0.04777466
NA06994 0.98700822 -0.6191967 -0.008505635 1.2562128 1.6287356 0.4081670 0.83275827
NA07000 1.09080450 1.4241525 1.197951582 -0.7609975 -0.1438943 -0.6702508 1.35827952
NA07037 -0.69449144 -0.6278185 -0.072521733 -0.4292956 0.0514267 -0.3441068 0.48754139
NA07048 -0.61145442 -1.0847700 0.016115941 -0.4755796 0.6484455 2.1055441 0.36660554
NA07051 1.46447109 -0.5502229 0.592260816 -0.9564188 -0.4939979 -0.1131873 -1.87620479
NA07056 -0.46613964 1.1913658 1.724853306 -0.9088491 0.6355469 -1.0797163 -1.10661301
NA07347 -1.94004675 0.7430361 -1.727091631 1.4734904 -0.5437494 0.8461998 0.61947141
NA07357 -0.03019528 1.4413959 -0.151310775 -0.7378555 -1.1389255 -0.5702651 0.80417343
答案 0 :(得分:3)
您可以执行以下操作:
In [19]:
(df - df.mean())/df.std()
Out[19]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
ind
NA06984 0.343471 -0.576088 0.439607 -0.671001 -1.859402 -1.013059 -0.506331
NA06986 -0.673732 -1.136500 -0.929353 0.901369 1.090682 1.079500 -1.027455
NA06989 0.530305 -0.205354 -1.082006 1.308925 0.125133 -0.648825 0.047775
NA06994 0.987008 -0.619197 -0.008506 1.256213 1.628736 0.408167 0.832758
NA07000 1.090804 1.424152 1.197952 -0.760998 -0.143894 -0.670251 1.358280
NA07037 -0.694491 -0.627818 -0.072522 -0.429296 0.051427 -0.344107 0.487541
NA07048 -0.611454 -1.084770 0.016116 -0.475580 0.648445 2.105544 0.366606
NA07051 1.464471 -0.550223 0.592261 -0.956419 -0.493998 -0.113187 -1.876205
NA07056 -0.466140 1.191366 1.724853 -0.908849 0.635547 -1.079716 -1.106613
NA07347 -1.940047 0.743036 -1.727092 1.473490 -0.543749 0.846200 0.619471
NA07357 -0.030195 1.441396 -0.151311 -0.737856 -1.138926 -0.570265 0.804173
这将对整个df进行操作,因此无需迭代行/列