我正在尝试自动化并构建更简洁的代码。 我希望我的代码获取CSV,并按X分组(当前变量名为“ Class”) 然后从均值中删除每3std。
import pandas as pd
import numpy as np
my_path = "data_291018.csv"
data_loc = pd.read_csv(my_path)
df = pd.DataFrame(data_loc)
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1)
class_8 = df[df["Class"] == 8]
class_11 = df[df["Class"] == 11]
heads = df.columns[4:].values
for i in heads:
class_8[i] = class_8[i].apply(lambda x: x if abs(x-class_8[i].mean()) < 3*class_8[i].std() else np.nan)
class_11[i] = class_11[i].apply(lambda x: x if abs(x-class_11[i].mean()) < 3*class_11[i].std() else np.nan)
both = pd.concat([class_8, class_11])
both.to_csv("data.csv", sep=',')
我尝试过不要在两个不同的DF上运行
new_df = df.copy()
class_df = df.groupby("Class")
并运行
for i in heads:
new_df[i] = new_df[i].apply(lambda x: x if abs(x-class_df[i].mean()) < 3*class_df[i].std() else np.nan)
它失败了... “ raise ValueError(”只能比较标记相同的“ ValueError :(“只能比较标记相同的Series对象,在索引SubjNum处出现”)”
能帮我吗? 在以后的阶段中,我想按1个以上的变量进行分组。
非常感谢您!
DF看起来像这样:
SubjNum Class Genderm1f2 LRLevel exp1 exp2 exp3 exp4 exp5
8001 8 1 1 88 2 15 19 92
8002 8 2 1 85 59 19 20 97
8003 8 2 1 84 52 12 18 91
8004 11 2 1 85 44 17 20 92
8005 11 2 1 81 35 400 18 93
8006 11 1 1 190 56 20 17 97
我要根据类别/性别等从平均值中删除超过3 std的单元格。
SubjNum Class Genderm1f2 LRLevel exp1 exp2 exp3 exp4 exp5
8001 8 1 1 88 . 15 19 92
8002 8 2 1 85 59 19 20 97
8003 8 2 1 84 52 12 18 91
8004 11 2 1 85 44 17 20 92
8005 11 2 1 81 35 . 18 93
8006 11 1 1 . 56 20 17 97
答案 0 :(得分:0)
据我所知,我只是将观察结果放在这里,以便您可以查看其是否与您要寻找的内容相关,但是专家们仍在等待完美答案:
您的示例中的模拟dataFrame:
>>> df
SubjNum Class Genderm1f2 LRLevel exp1 exp2 exp3 exp4 exp5
0 8001 8 1 1 88 2 15 19 92
1 8002 8 2 1 85 59 19 20 97
2 8003 8 2 1 84 52 12 18 91
3 8004 11 2 1 85 44 17 20 92
4 8005 11 2 1 81 35 400 18 93
5 8006 11 1 1 190 56 20 17 97
基于这两列的平均值:
>>> df.groupby(['Class', 'Genderm1f2']).mean()
SubjNum LRLevel exp1 exp2 exp3 exp4 exp5
Class Genderm1f2
8 1 8001.0 1.0 88.0 2.0 15.0 19.0 92.0
2 8002.5 1.0 84.5 55.5 15.5 19.0 94.0
11 1 8006.0 1.0 190.0 56.0 20.0 17.0 97.0
2 8004.5 1.0 83.0 39.5 208.5 19.0 92.5
基于这两列的标准差:
>>> df.groupby(['Class', 'Genderm1f2']).std()
SubjNum LRLevel exp1 exp2 exp3 exp4 exp5
Class Genderm1f2
8 1 NaN NaN NaN NaN NaN NaN NaN
2 0.707107 0.0 0.707107 4.949747 4.949747 1.414214 4.242641
11 1 NaN NaN NaN NaN NaN NaN NaN
2 0.707107 0.0 2.828427 6.363961 270.821897 1.414214 0.707107
只需对两个所需的列进行分组即可,它们的总和为mean()
和std()
。
>>> df.groupby(['Class', 'Genderm1f2']).agg(['mean','std'])
SubjNum LRLevel exp1 exp2 exp3 exp4 exp5
mean std mean std mean std mean std mean std mean std mean std
Class Genderm1f2
8 1 8001.0 NaN 1 NaN 88.0 NaN 2.0 NaN 15.0 NaN 19 NaN 92.0 NaN
2 8002.5 0.707107 1 0.0 84.5 0.707107 55.5 4.949747 15.5 4.949747 19 1.414214 94.0 4.242641
11 1 8006.0 NaN 1 NaN 190.0 NaN 56.0 NaN 20.0 NaN 17 NaN 97.0 NaN
2 8004.5 0.707107 1 0.0 83.0 2.828427 39.5 6.363961 208.5 270.821897 19 1.414214 92.5 0.707107
将两个所需的列进行分组,汇总的mean()
和std()
的值大于3。
>>> df.groupby(['Class', 'Genderm1f2']).agg(['mean','std']) > 3
SubjNum LRLevel exp1 exp2 exp3 exp4 exp5
mean std mean std mean std mean std mean std mean std mean std
Class Genderm1f2
8 1 True False False False True False False False True False True False True False
2 True False False False True False True True True True True False True True
11 1 True False False False True False True False True False True False True False
2 True False False False True False True True True True True False True False