在R中使用'subset'函数时帮助解决持久性问题

时间:2011-07-11 14:24:06

标签: r extract extraction subset

我想使用R中的subset函数来提取较小的小组研究时间序列数据组。

我的数据包含一个由六列组成的数据框:区(8区),性别,年龄间隔(4组),年,月和计数列。

示例:

  District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern  Female 2003     1           0           4
2 Eastern  Female 2003     1        01-4           1
3 Eastern  Female 2003     1       05-14           1
4 Eastern  Female 2003     1         15+          91
5 Eastern  Female 2003     2           0           4
6 Eastern  Female 2003     2        01-4           1

我想为每个区域提取较小的子集,性别和年龄间隔得到这样的结果:

     District  Gender Year Month AgeGroupNew TotalDeaths
     Northern    Male 2003     1        01-4           0
     Northern    Male 2003     2        01-4           1
     Northern    Male 2003     3        01-4           0
     Northern    Male 2003     4        01-4           3
     Northern    Male 2003     5        01-4           4
     Northern    Male 2003     6        01-4           6
     Northern    Male 2003     7        01-4           5
     Northern    Male 2003     8        01-4           0
     Northern    Male 2003     9        01-4           1
     Northern    Male 2003    10        01-4           2
     Northern    Male 2003    11        01-4           0
     Northern    Male 2003    12        01-4           1
     Northern    Male 2004     1        01-4           1
     Northern    Male 2004     2        01-4           0

转到

     Northern    Male 2006    11        01-4           0
     Northern    Male 2006    12        01-4           0

到目前为止,我一直在尝试使用它,这要归功于DWin在previous question中指出它。

subset(datNew, subset=(District=="Eastern" &  Gender=="Female" &  AgeGroupNew=="01-4"))
[1] District    Gender      Year        Month       AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)

但是R继续给我上面的输出 - 它不应该。

我已经尝试了其他成功的组合,但似乎在subset中使用“区域”会导致此<0 rows> (or 0-length row.names)

这有效:

> head(subset(datNew, Year=="2004" & Month=="8" & AgeGroupNew =="0"))
         District Gender Year Month AgeGroupNew TotalDeaths
77       Eastern  Female 2004     8           0          10
269      Eastern    Male 2004     8           0           6
461  Khayelitsha  Female 2004     8           0          13
653  Khayelitsha    Male 2004     8           0          15
845  Klipfontein  Female 2004     8           0           7
1037 Klipfontein    Male 2004     8           0           6

但不是

> head(subset(datNew, District=="Eastern" & Gender=="Female" & AgeGroupNew =="0"))
[1] District    Gender      Year        Month       AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)

区域造成这种情况的原因是什么?这个子集的组合有0行是绝对错误的 - 据我所知,有足够的数据。

我已经尝试过 - 而且从其他帖子来看,这是我想要实现的目标,但仍然无效:

> head(subset(datNew,datNew[[1]] %in% District[1] & Gender=="Female" & AgeGroupNew=="0"))
   District Gender Year Month AgeGroupNew TotalDeaths
1  Eastern  Female 2003     1           0           4
5  Eastern  Female 2003     2           0           4
9  Eastern  Female 2003     3           0           5
13 Eastern  Female 2003     4           0          12
17 Eastern  Female 2003     5           0           7
21 Eastern  Female 2003     6           0          13

有了这个,我无法从其他地区中选择,例如“南方”,“Khayelitsha”等。无论我改变datNew[[1 or 2 or 3]]District[[1 or 2 or 3]]。 我真的不知道%in%上面做了什么?

我很困惑。任何帮助组合。

1 个答案:

答案 0 :(得分:2)

预测:给我们结果str(datNew $ District [1]),所有内容都将被揭示。我预测会出现一个非打印字符,可能是一个尾随空格(或两个)。

因此,使用str(...)的结果,正确的代码将是:

subset(datNew, District=="Eastern " & Gender=="Female" & AgeGroupNew =="0")