请帮忙。看似简单,只是无法弄清楚
DataFrame(df)包含数字。对于每一栏:
*计算平均值和标准值
*为每列中每行的每个值计算一个新值
*使用新值
更改该值 方法1
import numpy as np
import pandas as pd
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
df.set_value(x, df.columns.values[n], y)
n = n+1
方法2
labels = df.columns.values.tolist()
df2 = df.ix[:,0]
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
ls = []
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
ls.append(y)
df2 = pd.DataFrame({labels[n]:str(ls)})
df1 = pd.concat([df1, df2], axis=1, ignore_index=True)
n = n+1
错误:ValueError:如果使用所有标量值,则必须传递索引
还尝试了.apply方法,但新的DataFrame并没有改变这些值。
print(df.to_json()):
{"col1":{"subj1":4161.97,"subj2":5794.73,"subj3":4740.44,"subj4":4702.84,"subj5":3818.94},"col2":{"subj1":13974.62,"subj2":19635.32,"subj3":17087.721851,"subj4":13770.461021,"subj5":11546.157578},"col3":{"subj1":270.7,"subj2":322.607708,"subj3":293.422314,"subj4":208.644585,"subj5":210.619961},"col4":{"subj1":5400.16,"subj2":7714.080365,"subj3":6023.026011,"subj4":5880.187272,"subj5":4880.056292}}
答案 0 :(得分:1)
通过删除均值并缩放到单位方差,您可以标准化每列的标准化。您可以使用scikit-learn的standardScaler:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
new_df = pd.DataFrame(scaler.fit_transform(df.T), columns=df.columns, index=df.index)
Here是相同
的文档答案 1 :(得分:0)
您似乎正在尝试对DataFrame列和值进行操作,就像DataFrame是简单的列表或数组一样,而不是像NumPy和Pandas那样更常用的矢量化/一次一列的方式
简单的首次改进可能是:
# import your data
import json
df = pd.DataFrame(json.loads(json_text))
# loop over only numeric columns
for col in df.select_dtypes([np.number]):
# compute column mean and std
col_mean = df[col].mean()
col_std = df[col].std()
# adjust column to normalized values
df[col] = df[col].apply(lambda x: (x - col_mean) / col_std)
这是按列向量化的。它保留了一些明确的循环,但是直截了当且相对初学者友好。
如果您对Pandas感到满意,可以更紧凑地完成:
numeric_cols = list(df.select_dtypes([np.number]))
df[numeric_cols] = df[numeric_cols].apply(lambda col: (col - col.mean()) / col.std(), axis=0)
在修订后的DataFrame中,没有字符串列。但是早期的DataFrame有字符串列,在计算它们时会产生问题,所以我们要小心。这是选择数字列的通用方法。如果它太多了,您可以通过明确列出它们来简化通用性成本:
numeric_cols = ['col1', 'col2', 'col3', 'col4']