如何将数据集与其自身的子集进行比较? [熊猫]

时间:2018-11-04 08:54:25

标签: python pandas compare pandas-groupby

我正在尝试自动化并构建更简洁的代码。 我希望我的代码获取CSV,并按X分组(当前变量名为“ Class”) 然后从均值中删除每3std。

import pandas as pd
import numpy as np


my_path = "data_291018.csv"
data_loc = pd.read_csv(my_path)

df = pd.DataFrame(data_loc)
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1)

class_8 = df[df["Class"] == 8]
class_11 = df[df["Class"] == 11]

heads = df.columns[4:].values

for i in heads:
    class_8[i] = class_8[i].apply(lambda x: x if abs(x-class_8[i].mean()) < 3*class_8[i].std() else np.nan)
    class_11[i] = class_11[i].apply(lambda x: x if abs(x-class_11[i].mean()) < 3*class_11[i].std() else np.nan)

both = pd.concat([class_8, class_11])

both.to_csv("data.csv", sep=',')

我尝试过不要在两个不同的DF上运行

new_df = df.copy()
class_df = df.groupby("Class")

并运行

for i in heads:
    new_df[i] = new_df[i].apply(lambda x: x if abs(x-class_df[i].mean()) < 3*class_df[i].std() else np.nan)

它失败了... “ raise ValueError(”只能比较标记相同的“ ValueError :(“只能比较标记相同的Series对象,在索引SubjNum处出现”)”

能帮我吗? 在以后的阶段中,我想按1个以上的变量进行分组。

非常感谢您!

DF看起来像这样:

SubjNum Class   Genderm1f2  LRLevel exp1    exp2    exp3    exp4    exp5

8001    8   1   1   88  2   15  19  92

8002    8   2   1   85  59  19  20  97

8003    8   2   1   84  52  12  18  91

8004    11  2   1   85  44  17  20  92

8005    11  2   1   81  35  400 18  93

8006    11  1   1   190 56  20  17  97

我要根据类别/性别等从平均值中删除超过3 std的单元格。

SubjNum Class   Genderm1f2  LRLevel exp1    exp2    exp3    exp4    exp5

8001    8   1   1   88  . 15    19  92

8002    8   2   1   85  59  19  20  97

8003    8   2   1   84  52  12  18  91

8004    11  2   1   85  44  17  20  92

8005    11  2   1   81  35  . 18    93

8006    11  1   1   .   56  20  17  97

1 个答案:

答案 0 :(得分:0)

据我所知,我只是将观察结果放在这里,以便您可以查看其是否与您要寻找的内容相关,但是专家们仍在等待完美答案:

您的示例中的模拟dataFrame:

>>> df
   SubjNum  Class  Genderm1f2  LRLevel  exp1  exp2  exp3  exp4  exp5
0     8001      8           1        1    88     2    15    19    92
1     8002      8           2        1    85    59    19    20    97
2     8003      8           2        1    84    52    12    18    91
3     8004     11           2        1    85    44    17    20    92
4     8005     11           2        1    81    35   400    18    93
5     8006     11           1        1   190    56    20    17    97

基于这两列的平均值:

>>> df.groupby(['Class', 'Genderm1f2']).mean()
                  SubjNum  LRLevel   exp1  exp2   exp3  exp4  exp5
Class Genderm1f2
8     1            8001.0      1.0   88.0   2.0   15.0  19.0  92.0
      2            8002.5      1.0   84.5  55.5   15.5  19.0  94.0
11    1            8006.0      1.0  190.0  56.0   20.0  17.0  97.0
      2            8004.5      1.0   83.0  39.5  208.5  19.0  92.5

基于这两列的标准差:

>>> df.groupby(['Class', 'Genderm1f2']).std()
                   SubjNum  LRLevel      exp1      exp2        exp3      exp4      exp5
Class Genderm1f2
8     1                NaN      NaN       NaN       NaN         NaN       NaN       NaN
      2           0.707107      0.0  0.707107  4.949747    4.949747  1.414214  4.242641
11    1                NaN      NaN       NaN       NaN         NaN       NaN       NaN
      2           0.707107      0.0  2.828427  6.363961  270.821897  1.414214  0.707107

只需对两个所需的列进行分组即可,它们的总和为mean()std()

>>> df.groupby(['Class', 'Genderm1f2']).agg(['mean','std'])
                 SubjNum           LRLevel        exp1            exp2             exp3             exp4            exp5
                    mean       std    mean  std   mean       std  mean       std   mean         std mean       std  mean       std
Class Genderm1f2
8     1           8001.0       NaN       1  NaN   88.0       NaN   2.0       NaN   15.0         NaN   19       NaN  92.0       NaN
      2           8002.5  0.707107       1  0.0   84.5  0.707107  55.5  4.949747   15.5    4.949747   19  1.414214  94.0  4.242641
11    1           8006.0       NaN       1  NaN  190.0       NaN  56.0       NaN   20.0         NaN   17       NaN  97.0       NaN
      2           8004.5  0.707107       1  0.0   83.0  2.828427  39.5  6.363961  208.5  270.821897   19  1.414214  92.5  0.707107

将两个所需的列进行分组,汇总的mean()std()的值大于3。

>>> df.groupby(['Class', 'Genderm1f2']).agg(['mean','std']) > 3
                 SubjNum        LRLevel         exp1          exp2         exp3         exp4         exp5
                    mean    std    mean    std  mean    std   mean    std  mean    std  mean    std  mean    std
Class Genderm1f2
8     1             True  False   False  False  True  False  False  False  True  False  True  False  True  False
      2             True  False   False  False  True  False   True   True  True   True  True  False  True   True
11    1             True  False   False  False  True  False   True  False  True  False  True  False  True  False
      2             True  False   False  False  True  False   True   True  True   True  True  False  True  False