根据Python pandas中的函数迭代并更改值

时间:2017-07-11 22:03:22

标签: python pandas

请帮忙。看似简单,只是无法弄清楚 DataFrame(df)包含数字。对于每一栏:
*计算平均值和标准值 *为每列中每行的每个值计算一个新值
*使用新值

更改该值  方法1

import numpy as np
import pandas as pd
n = 1
while n<len(df.column.values.tolist()):
    col = df.values[:,n]
    mean = sum(col)/len(col)
    std = np.std(col, axis = 0)
    for x in df[df.columns.values[n]]:
        y = (float(x) - float(mean)) / float(std)
        df.set_value(x, df.columns.values[n], y)
    n = n+1

方法2

    labels = df.columns.values.tolist()
    df2 = df.ix[:,0]
    n = 1
    while n<len(df.column.values.tolist()):
        col = df.values[:,n]
        mean = sum(col)/len(col)
        std = np.std(col, axis = 0)
        ls = []
        for x in df[df.columns.values[n]]:
            y = (float(x) - float(mean)) / float(std)
            ls.append(y)
        df2 = pd.DataFrame({labels[n]:str(ls)})
        df1 = pd.concat([df1, df2], axis=1, ignore_index=True)
        n = n+1

错误:ValueError:如果使用所有标量值,则必须传递索引

还尝试了.apply方法,但新的DataFrame并没有改变这些值。

print(df.to_json()):
{"col1":{"subj1":4161.97,"subj2":5794.73,"subj3":4740.44,"subj4":4702.84,"subj5":3818.94},"col2":{"subj1":13974.62,"subj2":19635.32,"subj3":17087.721851,"subj4":13770.461021,"subj5":11546.157578},"col3":{"subj1":270.7,"subj2":322.607708,"subj3":293.422314,"subj4":208.644585,"subj5":210.619961},"col4":{"subj1":5400.16,"subj2":7714.080365,"subj3":6023.026011,"subj4":5880.187272,"subj5":4880.056292}}

2 个答案:

答案 0 :(得分:1)

通过删除均值并缩放到单位方差,您可以标准化每列的标准化。您可以使用scikit-learn的standardScaler:

from sklearn import preprocessing

scaler= preprocessing.StandardScaler()
new_df = pd.DataFrame(scaler.fit_transform(df.T), columns=df.columns, index=df.index)

Here是相同

的文档

答案 1 :(得分:0)

您似乎正在尝试对DataFrame列和值进行操作,就像DataFrame是简单的列表或数组一样,而不是像NumPy和Pandas那样更常用的矢量化/一次一列的方式

简单的首次改进可能是:

# import your data
import json
df = pd.DataFrame(json.loads(json_text))

# loop over only numeric columns
for col in df.select_dtypes([np.number]):
    # compute column mean and std
    col_mean = df[col].mean()
    col_std  = df[col].std()
    # adjust column to normalized values
    df[col] = df[col].apply(lambda x: (x - col_mean) / col_std)

这是按列向量化的。它保留了一些明确的循环,但是直截了当且相对初学者友好。

如果您对Pandas感到满意,可以更紧凑地完成:

numeric_cols = list(df.select_dtypes([np.number]))
df[numeric_cols] = df[numeric_cols].apply(lambda col: (col - col.mean()) / col.std(), axis=0)

在修订后的DataFrame中,没有字符串列。但是早期的DataFrame有字符串列,在计算它们时会产生问题,所以我们要小心。这是选择数字列的通用方法。如果它太多了,您可以通过明确列出它们来简化通用性成本:

numeric_cols = ['col1', 'col2', 'col3', 'col4']