基于过滤器选择行和列时的奇怪行为。发生了什么事?

时间:2019-02-11 15:16:36

标签: r dataframe

我有一个data.frame,其中有259行(观察)和164列(变量)。在下面,我仅打印一些观察结果和变量,以使您大致了解我正在处理的数据类型。

> head(fp_df_wc)
  individual species  sex svl_cm mass_kg bci smi eggs        island long lat capt_year capt_month season   Mol_1 Mol_2   Mol_3 Mol_4   Mol_5 Mol_6
1        A15       Y <NA>     NA      NA  NA  NA   NA          <NA>   NA  NA      2018   December     nr 0.06406     0 1.79751     0 2.94364     0
2        Ac1       B <NA>     NA      NA  NA  NA   NA          <NA>   NA  NA        NA       <NA>   <NA> 0.31578     0 0.30990     0 0.39433     0
3       Ac11       B <NA>     NA      NA  NA  NA   NA          <NA>   NA  NA        NA       <NA>   <NA> 0.00000     0 0.52960     0 0.87975     0
4        Ac2       B <NA>     NA      NA  NA  NA   NA          <NA>   NA  NA        NA       <NA>   <NA> 0.39010     0 0.46395     0 0.69943     0
5        Ac3       B <NA>     NA      NA  NA  NA   NA          <NA>   NA  NA        NA       <NA>   <NA> 0.00000     0 0.36697     0 0.59648     0
6        Ac4       B <NA>     NA      NA  NA  NA   NA          <NA>   NA  NA        NA       <NA>   <NA> 0.00000     0 0.37882     0 0.73668     0

请注意,还有更多的列和行,但是我只是在此粘贴前五行和几列。是的,存在很多NA值的事实是绝对正常的,我认为与我的问题无关(如果确实如此,我不确定为什么!)

在这个名为fp_df_wc的data.frame上,我需要对共享某些属性的个人组执行一些计算。例如,我需要根据个人的种类质量_千克 caps_month 属性对个人进行分组。

因此,我试图首先选择所有属于 P 物种的个体,这些个体收集于 6月中,且质量大于 2 使用以下代码:

fp_df_wc[fp_df_wc$species == "P" & fp_df_wc$capt_month == "June" & fp_df_wc$mass_kg > 2, ]

这项工作非常漂亮,在这里我展示了此选择的前几行:

> head(fp_df_wc[fp_df_wc$species == "P" & fp_df_wc$capt_month == "June" & fp_df_wc$mass_kg > 2, ])
   individual species sex svl_cm mass_kg   bci smi eggs  island     long  lat capt_year capt_month season   Mol_1   Mol_2   Mol_3 Mol_4   Mol_5
30       w371       P   M   45.5    4.27 45.33  NA    0    <NA>       NA   NA      2012       June      r 0.22058 0.16373 0.55590     0 0.24355
32       w373       P   F   45.5    3.63 51.65  NA    0    <NA>       NA   NA      2012       June      r 3.86393 0.01546 4.24033     0 1.95668
36       w377       P   F   43.5    4.13 50.17  NA    0    <NA>       NA   NA      2012       June      r 0.00000 0.00000 0.34042     0 0.12530
37       w378       P   M   45.8    4.63 48.19  NA    0    <NA>       NA   NA      2012       June      r 3.81820 0.01919 6.41375     0 2.85729
50       w391       P   M   48.0    5.09 46.03  NA    0    <NA>       NA   NA      2012       June      r 1.13196 0.00000 2.11037     0 0.89921
51       w392       P   M   47.0    4.53 43.63  NA    0    <NA>       NA   NA      2012       June      r 0.00000 0.25263 1.35737     0 0.71165

现在,这是时髦的地方!我基本上想对另一个名为 Y 的物种做同样的事情,所以我唯一要做的就是将第一个参数更改为:

fp_df_wc[fp_df_wc$species == "Y" & fp_df_wc$capt_month == "June" & fp_df_wc$mass_kg > 2, ]

但这并不能很好地发挥作用,实际上我得到了这个信息:

> head(fp_df_wc[fp_df_wc$species == "Y" & fp_df_wc$capt_month == "June" & fp_df_wc$mass_kg > 2, ])
     individual species  sex svl_cm mass_kg bci smi eggs island long lat capt_year capt_month season Mol_1 Mol_2 Mol_3 Mol_4 Mol_5 Mol_6 Mol_7 Mol_8
NA         <NA>    <NA> <NA>     NA      NA  NA  NA   NA   <NA>   NA  NA        NA       <NA>   <NA>    NA    NA    NA    NA    NA    NA    NA    NA
NA.1       <NA>    <NA> <NA>     NA      NA  NA  NA   NA   <NA>   NA  NA        NA       <NA>   <NA>    NA    NA    NA    NA    NA    NA    NA    NA
NA.2       <NA>    <NA> <NA>     NA      NA  NA  NA   NA   <NA>   NA  NA        NA       <NA>   <NA>    NA    NA    NA    NA    NA    NA    NA    NA
NA.3       <NA>    <NA> <NA>     NA      NA  NA  NA   NA   <NA>   NA  NA        NA       <NA>   <NA>    NA    NA    NA    NA    NA    NA    NA    NA
NA.4       <NA>    <NA> <NA>     NA      NA  NA  NA   NA   <NA>   NA  NA        NA       <NA>   <NA>    NA    NA    NA    NA    NA    NA    NA    NA
NA.5       <NA>    <NA> <NA>     NA      NA  NA  NA   NA   <NA>   NA  NA        NA       <NA>   <NA>    NA    NA    NA    NA    NA    NA    NA    NA

一切都变成NA,行号也变成了!!!

我不太确定到底发生了什么!!我期望的是针对 Y 个人而不是 P 的结果。

它为什么以这种方式运行?同样,我的data.frame中有很多NA,但它们都属于 P Y 主题,所以这不应该成为问题。此外, Y 中也有很多行符合我的选择标准,因此它确实应该可以工作。

非常感谢您的帮助。

对评论的新编辑

@Soren,谢谢您的投入。我知道var == NA没什么可说的。但是我仍然不相信这是我的情况。如果我查看用于选择个人的变量,则它们都不具有NA值(见下文):

> fp_df_wc[30:241, c(2, 3, 5, 13)]
    species sex mass_kg capt_month
30        P   M    4.27       June
31        Y   M    5.82       June
32        P   F    3.63       June
33        Y   F    3.89       June
34        Y   F    4.66       June
35        Y   F    5.29       June
36        P   F    4.13       June
37        P   M    4.63       June
38        Y   M    7.09       June
39        Y   M    4.82       June
40        Y   F    3.04       June
41        Y   F    3.88       June
42        Y   F    4.24       June
43        Y   F    3.40       June
44        Y   F    4.07       June
45        Y   F    4.90       June
46        Y   M    7.03       June
47        Y   F    3.95       June
48        Y   M    7.64       June
49        Y   M    6.96       June
50        P   M    5.09       June
51        P   M    4.53       June
52        P   F    4.65       June
53        P   M    5.43       June
54        Y   F    4.65       June
55        P   M    6.22       June
56        P   M    5.16       June
57        P   F    3.67       June
58        P   F    4.00       June
59        P   F    3.84       June
60        P   M    5.27       June
61        P   M    6.14       June
62        Y   M    7.20       June
63        P   M    5.85       June
64        P   F    2.84       June
65        P   M    6.18       June
66        P   M    6.33       June
67        P   M    5.56       June
68        P   M    5.38       June
69        P   M    5.70       June
70        P   M    6.44       June
71        Y   F    5.52       June
72        Y   M    6.16       June
73        Y   F    5.80       June
74        Y   M    7.40       June
75        Y   M    6.94       June
76        Y   M    8.30       June
77        Y   M    7.62       June
78        P   M    4.92       June
79        P   M    5.80       June
80        P   M    4.94       June
81        P   M    5.28       June
82        P   F    3.67       June
83        P   F    4.33       June
84        P   M    6.23       June
85        P   F    3.51       June
86        P   F    3.58       June
87        P   M    6.11       June
88        P   M    4.96       June
89        Y   M    7.80       June
90        Y   M    6.56       June
91        Y   M    6.19       June
92        Y   F    4.17       June
93        Y   F    4.76       June
94        P   M    4.98       June
95        P   M    5.34       June
96        P   M    5.16       June
97        P   M    5.58       June
98        P   M    4.97       June
99        Y   M    6.20       June
100       P   M    5.32       June
101       P   F    3.74       June
102       P   M    5.45       June
103       Y   F    6.24       June
104       P   F    4.60       June
105       Y   M    7.24       June
106       P   M    5.40       June
107       Y   M    6.61       June
108       P   M    6.80       June
109       Y   M    6.66       June
110       Y   F    4.02       June
111       Y   M    5.96       June
112       Y   F    4.10       June
113       P   F    3.88       June
114       P   M    4.60       June
115       Y   M    5.94       June
116       P   M    4.73       June
117       Y   M    6.75       June
118       P   M    5.71       June
119       Y   M    8.55       June
120       Y   M    6.55       June
121       P   M    6.45       June
122       P   F    4.16       June
123       P   M    6.54       June
124       Y   F    3.88       June
125       P   M    5.39       June
126       Y   M    6.71       June 
127       P   F    3.41       June
128       Y   M    6.71       June
129       Y   F    4.26       June
130       Y   F    3.45       June
131       Y   F    3.74       June
132       P   F    3.34       June
133       Y   M    6.10       June
134       Y   F    4.85       June
135       Y   F    5.14       June
136       Y   M    6.80       June
137       Y   M    6.30       June
138       Y   M    6.90       June
139       P   M    5.27       June
140       Y   M    6.72       June
141       P   M    4.31       June
142       P   M    2.84       June
143       P   M    4.42       June
144       P   M    4.96       June
145       Y   F    4.49       June
146       P   M    5.40       June
147       P   M    5.48       June
148       P   M    5.90       June
149       P   M    5.53       June
150       P   M    6.42       June
151       Y   F    3.56       June
152       Y   M    6.47       June
153       P   M    5.59       June
154       P   M    5.40       June
155       P   M    5.26       June
156       Y   M    7.29       June
157       Y   M    7.16       June
158       Y   M    6.56       June
159       Y   M    7.33       June
160       P   F    4.62       June
161       Y   M    6.13       June
162       Y   M    5.12       June
163       P   F    3.50       June
164       P   M    5.67       June
165       P   F    3.29       June
166       P   J    1.41       June
167       Y   M    4.84       June
168       P   M    5.27       June
169       P   M    5.91       June
170       Y   F    4.75       June
171       Y   F    4.25       June
172       P   F    3.59       June
173       P   M    3.98       June
174       P   F    3.56       June
175       Y   F    3.88       June
176       Y   F    4.39       June
177       Y   M    5.45       June
178       Y   M    5.50       June
179       Y   F    3.16       June
180       Y   F    3.60       June
181       P   F    2.68       June
182       Y   M    6.25       June
183       Y   M    7.10       June
184       Y   F    5.22       June
185       Y   M    4.30       June
186       Y   F    4.33       June
187       P   M    5.59       June
188       P   F    3.65       June
189       P   M    6.47       June
190       P   M    5.61       June
191       Y   M    7.36       June
192       Y   M    8.34       June
193       P   M    4.46       June
194       P   M    5.79       June
195       P   M    5.52       June
196       P   M    5.69       June
197       P   F    4.16       June
198       P   M    5.49       June
199       P   M    5.13       June
200       P   M    6.25       June
201       P   M    4.97       June
202       Y   M    6.88       June
203       P   F    3.99       June
204       Y   M    6.92       June
205       Y   M    6.50       June
206       P   M    4.25       June
207       P   F    3.49       June
208       P   M    5.72       June
209       P   M    5.65       June
210       Y   M    7.34       June
211       Y   M    7.25       June
212       P   F    3.62       June
213       P   F    4.02       June
214       P   F    4.90       June
215       P   F    3.66       June  
216       Y   F    3.65       June
217       Y   F    4.90       June
218       Y   M    6.75       June
219       P   F    3.64       June
220       Y   M    7.22       June
221       Y   M    7.43       June
222       Y   M    7.23       June
223       Y   M    7.32       June
224       P   F    3.71       June
225       P   F    4.26       June
226       Y   F    6.32       June
227       P   F    4.61       June
228       Y   F    4.71       June
229       Y   M    6.33       June
230       Y   M    6.70       June
231       P   M    4.90       June
232       P   F    3.60       June
233       P   F    3.74       June
234       Y   F    3.76       June
235       Y   F    4.45       June
236       Y   F    4.45       June
237       P   F    3.95       June
238       Y   F    3.90       June
239       Y   F    4.12       June
240       Y   F    4.79       June
241       Y   F    3.87       June 

在这种情况下,碰巧所有动物都按顺序组织在数据库中,因此无论如何我都可以进行分析,但是我试图理解为什么我以前的代码无法正常工作。

@Ronak,您的建议确实有效,但是我仍然不明白为什么以前的代码行不通。

第二编辑

> sapply(fp_df_wc,function(x) { sum(is.na(x))})
individual    species        sex     svl_cm    mass_kg        bci        smi       eggs     island 
         0          2         11         28         28         51        259        107          7 
  long        lat  capt_year capt_month     season      Mol_1      Mol_2      Mol_3      Mol_4 
    34         34          5         26         26          0          0          0          0 
 Mol_5      Mol_6      Mol_7      Mol_8      Mol_9     Mol_10     Mol_11     Mol_12     Mol_13 
     0          0          0          0          0          0          0          0          0 
Mol_14     Mol_15     Mol_16     Mol_17     Mol_18     Mol_19     Mol_20     Mol_21     Mol_22 
     0          0          0          0          0          0          0          0          0 
Mol_23     Mol_24     Mol_25     Mol_26     Mol_27     Mol_28     Mol_29     Mol_30     Mol_31 
     0          0          0          0          0          0          0          0          0 
Mol_32     Mol_33     Mol_34     Mol_35     Mol_36     Mol_37     Mol_38     Mol_39     Mol_40 
     0          0          0          0          0          0          0          0          0 
Mol_41     Mol_42     Mol_43     Mol_44     Mol_45     Mol_46     Mol_47     Mol_48     Mol_49 
     0          0          0          0          0          0          0          0          0 
Mol_50     Mol_51     Mol_52     Mol_53     Mol_54     Mol_55     Mol_56     Mol_57     Mol_58 
     0          0          0          0          0          0          0          0          0 
Mol_59     Mol_60     Mol_61     Mol_62     Mol_63     Mol_64     Mol_65     Mol_66     Mol_67 
     0          0          0          0          0          0          0          0          0 
Mol_68     Mol_69     Mol_70     Mol_71     Mol_72     Mol_73     Mol_74     Mol_75     Mol_76 
     0          0          0          0          0          0          0          0          0 
Mol_77     Mol_78     Mol_79     Mol_80     Mol_81     Mol_82     Mol_83     Mol_84     Mol_85 
     0          0          0          0          0          0          0          0          0 
Mol_86     Mol_87     Mol_88     Mol_89     Mol_90     Mol_91     Mol_92     Mol_93     Mol_94 
     0          0          0          0          0          0          0          0          0 
Mol_95     Mol_96     Mol_97     Mol_98     Mol_99    Mol_100    Mol_101    Mol_102    Mol_103 
     0          0          0          0          0          0          0          0          0 
Mol_104    Mol_105    Mol_106    Mol_107    Mol_108    Mol_109    Mol_110    Mol_111    Mol_112 
      0          0          0          0          0          0          0          0          0 
Mol_113    Mol_114    Mol_115    Mol_116    Mol_117    Mol_118    Mol_119    Mol_120    Mol_121 
      0          0          0          0          0          0          0          0          0 
Mol_122    Mol_123    Mol_124    Mol_125    Mol_126    Mol_127    Mol_128    Mol_129    Mol_130 
      0          0          0          0          0          0          0          0          0 
Mol_131    Mol_132    Mol_133    Mol_134    Mol_135    Mol_136    Mol_137    Mol_138    Mol_139 
      0          0          0          0          0          0          0          0          0 
Mol_140    Mol_141    Mol_142    Mol_143    Mol_144    Mol_145    Mol_146    Mol_147    Mol_148 
      0          0          0          0          0          0          0          0          0 
Mol_149    Mol_150 
      0          0 
> sapply(fp_df_wc[30:241, c(2, 3, 5, 13)], function(x) {sum(is.na(x))})
species        sex    mass_kg capt_month 
      0          0          0          0 

1 个答案:

答案 0 :(得分:1)

您将收到此结果,因为变量== NA将返回NA,因为NA没有等价物。演示:

  x <- data.frame(n=c(1:10),var=c(rep(NA,5),c(1:5)))
  x[x$var==4,]

这将返回var == 4 == TRUE的所有内容以及NA的所有内容。

解决方案可能有几种。要在当量测试中检查NA值,请执行以下操作:

  x[!is.na(x$var) & x$var==4,]

或使用data.frame子集函数(如果没有值返回true,则将返回nrow()== 0的数据帧):

  subset(x,x$var==4)

或在生成NA后省略它们(这可能会产生意想不到的结果,因为将省略具有任何NA值的行)

  na.omit(x[x$var==4,])