我有一个data.frame,其中有259行(观察)和164列(变量)。在下面,我仅打印一些观察结果和变量,以使您大致了解我正在处理的数据类型。
> head(fp_df_wc)
individual species sex svl_cm mass_kg bci smi eggs island long lat capt_year capt_month season Mol_1 Mol_2 Mol_3 Mol_4 Mol_5 Mol_6
1 A15 Y <NA> NA NA NA NA NA <NA> NA NA 2018 December nr 0.06406 0 1.79751 0 2.94364 0
2 Ac1 B <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> 0.31578 0 0.30990 0 0.39433 0
3 Ac11 B <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> 0.00000 0 0.52960 0 0.87975 0
4 Ac2 B <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> 0.39010 0 0.46395 0 0.69943 0
5 Ac3 B <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> 0.00000 0 0.36697 0 0.59648 0
6 Ac4 B <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> 0.00000 0 0.37882 0 0.73668 0
请注意,还有更多的列和行,但是我只是在此粘贴前五行和几列。是的,存在很多NA值的事实是绝对正常的,我认为与我的问题无关(如果确实如此,我不确定为什么!)
在这个名为fp_df_wc的data.frame上,我需要对共享某些属性的个人组执行一些计算。例如,我需要根据个人的种类,质量_千克和 caps_month 属性对个人进行分组。
因此,我试图首先选择所有属于 P 物种的个体,这些个体收集于 6月中,且质量大于 2 使用以下代码:
fp_df_wc[fp_df_wc$species == "P" & fp_df_wc$capt_month == "June" & fp_df_wc$mass_kg > 2, ]
这项工作非常漂亮,在这里我展示了此选择的前几行:
> head(fp_df_wc[fp_df_wc$species == "P" & fp_df_wc$capt_month == "June" & fp_df_wc$mass_kg > 2, ])
individual species sex svl_cm mass_kg bci smi eggs island long lat capt_year capt_month season Mol_1 Mol_2 Mol_3 Mol_4 Mol_5
30 w371 P M 45.5 4.27 45.33 NA 0 <NA> NA NA 2012 June r 0.22058 0.16373 0.55590 0 0.24355
32 w373 P F 45.5 3.63 51.65 NA 0 <NA> NA NA 2012 June r 3.86393 0.01546 4.24033 0 1.95668
36 w377 P F 43.5 4.13 50.17 NA 0 <NA> NA NA 2012 June r 0.00000 0.00000 0.34042 0 0.12530
37 w378 P M 45.8 4.63 48.19 NA 0 <NA> NA NA 2012 June r 3.81820 0.01919 6.41375 0 2.85729
50 w391 P M 48.0 5.09 46.03 NA 0 <NA> NA NA 2012 June r 1.13196 0.00000 2.11037 0 0.89921
51 w392 P M 47.0 4.53 43.63 NA 0 <NA> NA NA 2012 June r 0.00000 0.25263 1.35737 0 0.71165
现在,这是时髦的地方!我基本上想对另一个名为 Y 的物种做同样的事情,所以我唯一要做的就是将第一个参数更改为:
fp_df_wc[fp_df_wc$species == "Y" & fp_df_wc$capt_month == "June" & fp_df_wc$mass_kg > 2, ]
但这并不能很好地发挥作用,实际上我得到了这个信息:
> head(fp_df_wc[fp_df_wc$species == "Y" & fp_df_wc$capt_month == "June" & fp_df_wc$mass_kg > 2, ])
individual species sex svl_cm mass_kg bci smi eggs island long lat capt_year capt_month season Mol_1 Mol_2 Mol_3 Mol_4 Mol_5 Mol_6 Mol_7 Mol_8
NA <NA> <NA> <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> NA NA NA NA NA NA NA NA
NA.1 <NA> <NA> <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> NA NA NA NA NA NA NA NA
NA.2 <NA> <NA> <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> NA NA NA NA NA NA NA NA
NA.3 <NA> <NA> <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> NA NA NA NA NA NA NA NA
NA.4 <NA> <NA> <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> NA NA NA NA NA NA NA NA
NA.5 <NA> <NA> <NA> NA NA NA NA NA <NA> NA NA NA <NA> <NA> NA NA NA NA NA NA NA NA
一切都变成NA,行号也变成了!!!
我不太确定到底发生了什么!!我期望的是针对 Y 个人而不是 P 的结果。
它为什么以这种方式运行?同样,我的data.frame中有很多NA,但它们都属于 P 和 Y 主题,所以这不应该成为问题。此外, Y 中也有很多行符合我的选择标准,因此它确实应该可以工作。
非常感谢您的帮助。
对评论的新编辑
@Soren,谢谢您的投入。我知道var == NA没什么可说的。但是我仍然不相信这是我的情况。如果我查看用于选择个人的变量,则它们都不具有NA值(见下文):
> fp_df_wc[30:241, c(2, 3, 5, 13)]
species sex mass_kg capt_month
30 P M 4.27 June
31 Y M 5.82 June
32 P F 3.63 June
33 Y F 3.89 June
34 Y F 4.66 June
35 Y F 5.29 June
36 P F 4.13 June
37 P M 4.63 June
38 Y M 7.09 June
39 Y M 4.82 June
40 Y F 3.04 June
41 Y F 3.88 June
42 Y F 4.24 June
43 Y F 3.40 June
44 Y F 4.07 June
45 Y F 4.90 June
46 Y M 7.03 June
47 Y F 3.95 June
48 Y M 7.64 June
49 Y M 6.96 June
50 P M 5.09 June
51 P M 4.53 June
52 P F 4.65 June
53 P M 5.43 June
54 Y F 4.65 June
55 P M 6.22 June
56 P M 5.16 June
57 P F 3.67 June
58 P F 4.00 June
59 P F 3.84 June
60 P M 5.27 June
61 P M 6.14 June
62 Y M 7.20 June
63 P M 5.85 June
64 P F 2.84 June
65 P M 6.18 June
66 P M 6.33 June
67 P M 5.56 June
68 P M 5.38 June
69 P M 5.70 June
70 P M 6.44 June
71 Y F 5.52 June
72 Y M 6.16 June
73 Y F 5.80 June
74 Y M 7.40 June
75 Y M 6.94 June
76 Y M 8.30 June
77 Y M 7.62 June
78 P M 4.92 June
79 P M 5.80 June
80 P M 4.94 June
81 P M 5.28 June
82 P F 3.67 June
83 P F 4.33 June
84 P M 6.23 June
85 P F 3.51 June
86 P F 3.58 June
87 P M 6.11 June
88 P M 4.96 June
89 Y M 7.80 June
90 Y M 6.56 June
91 Y M 6.19 June
92 Y F 4.17 June
93 Y F 4.76 June
94 P M 4.98 June
95 P M 5.34 June
96 P M 5.16 June
97 P M 5.58 June
98 P M 4.97 June
99 Y M 6.20 June
100 P M 5.32 June
101 P F 3.74 June
102 P M 5.45 June
103 Y F 6.24 June
104 P F 4.60 June
105 Y M 7.24 June
106 P M 5.40 June
107 Y M 6.61 June
108 P M 6.80 June
109 Y M 6.66 June
110 Y F 4.02 June
111 Y M 5.96 June
112 Y F 4.10 June
113 P F 3.88 June
114 P M 4.60 June
115 Y M 5.94 June
116 P M 4.73 June
117 Y M 6.75 June
118 P M 5.71 June
119 Y M 8.55 June
120 Y M 6.55 June
121 P M 6.45 June
122 P F 4.16 June
123 P M 6.54 June
124 Y F 3.88 June
125 P M 5.39 June
126 Y M 6.71 June
127 P F 3.41 June
128 Y M 6.71 June
129 Y F 4.26 June
130 Y F 3.45 June
131 Y F 3.74 June
132 P F 3.34 June
133 Y M 6.10 June
134 Y F 4.85 June
135 Y F 5.14 June
136 Y M 6.80 June
137 Y M 6.30 June
138 Y M 6.90 June
139 P M 5.27 June
140 Y M 6.72 June
141 P M 4.31 June
142 P M 2.84 June
143 P M 4.42 June
144 P M 4.96 June
145 Y F 4.49 June
146 P M 5.40 June
147 P M 5.48 June
148 P M 5.90 June
149 P M 5.53 June
150 P M 6.42 June
151 Y F 3.56 June
152 Y M 6.47 June
153 P M 5.59 June
154 P M 5.40 June
155 P M 5.26 June
156 Y M 7.29 June
157 Y M 7.16 June
158 Y M 6.56 June
159 Y M 7.33 June
160 P F 4.62 June
161 Y M 6.13 June
162 Y M 5.12 June
163 P F 3.50 June
164 P M 5.67 June
165 P F 3.29 June
166 P J 1.41 June
167 Y M 4.84 June
168 P M 5.27 June
169 P M 5.91 June
170 Y F 4.75 June
171 Y F 4.25 June
172 P F 3.59 June
173 P M 3.98 June
174 P F 3.56 June
175 Y F 3.88 June
176 Y F 4.39 June
177 Y M 5.45 June
178 Y M 5.50 June
179 Y F 3.16 June
180 Y F 3.60 June
181 P F 2.68 June
182 Y M 6.25 June
183 Y M 7.10 June
184 Y F 5.22 June
185 Y M 4.30 June
186 Y F 4.33 June
187 P M 5.59 June
188 P F 3.65 June
189 P M 6.47 June
190 P M 5.61 June
191 Y M 7.36 June
192 Y M 8.34 June
193 P M 4.46 June
194 P M 5.79 June
195 P M 5.52 June
196 P M 5.69 June
197 P F 4.16 June
198 P M 5.49 June
199 P M 5.13 June
200 P M 6.25 June
201 P M 4.97 June
202 Y M 6.88 June
203 P F 3.99 June
204 Y M 6.92 June
205 Y M 6.50 June
206 P M 4.25 June
207 P F 3.49 June
208 P M 5.72 June
209 P M 5.65 June
210 Y M 7.34 June
211 Y M 7.25 June
212 P F 3.62 June
213 P F 4.02 June
214 P F 4.90 June
215 P F 3.66 June
216 Y F 3.65 June
217 Y F 4.90 June
218 Y M 6.75 June
219 P F 3.64 June
220 Y M 7.22 June
221 Y M 7.43 June
222 Y M 7.23 June
223 Y M 7.32 June
224 P F 3.71 June
225 P F 4.26 June
226 Y F 6.32 June
227 P F 4.61 June
228 Y F 4.71 June
229 Y M 6.33 June
230 Y M 6.70 June
231 P M 4.90 June
232 P F 3.60 June
233 P F 3.74 June
234 Y F 3.76 June
235 Y F 4.45 June
236 Y F 4.45 June
237 P F 3.95 June
238 Y F 3.90 June
239 Y F 4.12 June
240 Y F 4.79 June
241 Y F 3.87 June
在这种情况下,碰巧所有动物都按顺序组织在数据库中,因此无论如何我都可以进行分析,但是我试图理解为什么我以前的代码无法正常工作。
@Ronak,您的建议确实有效,但是我仍然不明白为什么以前的代码行不通。
第二编辑
> sapply(fp_df_wc,function(x) { sum(is.na(x))})
individual species sex svl_cm mass_kg bci smi eggs island
0 2 11 28 28 51 259 107 7
long lat capt_year capt_month season Mol_1 Mol_2 Mol_3 Mol_4
34 34 5 26 26 0 0 0 0
Mol_5 Mol_6 Mol_7 Mol_8 Mol_9 Mol_10 Mol_11 Mol_12 Mol_13
0 0 0 0 0 0 0 0 0
Mol_14 Mol_15 Mol_16 Mol_17 Mol_18 Mol_19 Mol_20 Mol_21 Mol_22
0 0 0 0 0 0 0 0 0
Mol_23 Mol_24 Mol_25 Mol_26 Mol_27 Mol_28 Mol_29 Mol_30 Mol_31
0 0 0 0 0 0 0 0 0
Mol_32 Mol_33 Mol_34 Mol_35 Mol_36 Mol_37 Mol_38 Mol_39 Mol_40
0 0 0 0 0 0 0 0 0
Mol_41 Mol_42 Mol_43 Mol_44 Mol_45 Mol_46 Mol_47 Mol_48 Mol_49
0 0 0 0 0 0 0 0 0
Mol_50 Mol_51 Mol_52 Mol_53 Mol_54 Mol_55 Mol_56 Mol_57 Mol_58
0 0 0 0 0 0 0 0 0
Mol_59 Mol_60 Mol_61 Mol_62 Mol_63 Mol_64 Mol_65 Mol_66 Mol_67
0 0 0 0 0 0 0 0 0
Mol_68 Mol_69 Mol_70 Mol_71 Mol_72 Mol_73 Mol_74 Mol_75 Mol_76
0 0 0 0 0 0 0 0 0
Mol_77 Mol_78 Mol_79 Mol_80 Mol_81 Mol_82 Mol_83 Mol_84 Mol_85
0 0 0 0 0 0 0 0 0
Mol_86 Mol_87 Mol_88 Mol_89 Mol_90 Mol_91 Mol_92 Mol_93 Mol_94
0 0 0 0 0 0 0 0 0
Mol_95 Mol_96 Mol_97 Mol_98 Mol_99 Mol_100 Mol_101 Mol_102 Mol_103
0 0 0 0 0 0 0 0 0
Mol_104 Mol_105 Mol_106 Mol_107 Mol_108 Mol_109 Mol_110 Mol_111 Mol_112
0 0 0 0 0 0 0 0 0
Mol_113 Mol_114 Mol_115 Mol_116 Mol_117 Mol_118 Mol_119 Mol_120 Mol_121
0 0 0 0 0 0 0 0 0
Mol_122 Mol_123 Mol_124 Mol_125 Mol_126 Mol_127 Mol_128 Mol_129 Mol_130
0 0 0 0 0 0 0 0 0
Mol_131 Mol_132 Mol_133 Mol_134 Mol_135 Mol_136 Mol_137 Mol_138 Mol_139
0 0 0 0 0 0 0 0 0
Mol_140 Mol_141 Mol_142 Mol_143 Mol_144 Mol_145 Mol_146 Mol_147 Mol_148
0 0 0 0 0 0 0 0 0
Mol_149 Mol_150
0 0
> sapply(fp_df_wc[30:241, c(2, 3, 5, 13)], function(x) {sum(is.na(x))})
species sex mass_kg capt_month
0 0 0 0
答案 0 :(得分:1)
您将收到此结果,因为变量== NA将返回NA,因为NA没有等价物。演示:
x <- data.frame(n=c(1:10),var=c(rep(NA,5),c(1:5)))
x[x$var==4,]
这将返回var == 4 == TRUE的所有内容以及NA的所有内容。
解决方案可能有几种。要在当量测试中检查NA值,请执行以下操作:
x[!is.na(x$var) & x$var==4,]
或使用data.frame子集函数(如果没有值返回true,则将返回nrow()== 0的数据帧):
subset(x,x$var==4)
或在生成NA后省略它们(这可能会产生意想不到的结果,因为将省略具有任何NA值的行)
na.omit(x[x$var==4,])