从R中的分层数据中提取特定数据

时间:2011-07-09 21:51:12

标签: r hierarchical-data time-series

我有一个由6列组成的数据框。第1列至第5列各自具有离散的名称/值,例如区,年,月,年龄间隔和性别。第六列是该特定组合的死亡计数。

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1        -1            0
2              Eastern  Female 2003     1        -2            2
3              Eastern  Female 2003     1         0            2
4              Eastern  Female 2003     1      01-4            1
5              Eastern  Female 2003     1     05-09            0
6              Eastern  Female 2003     1     10-14            1
7              Eastern  Female 2003     1     15-19            0
8              Eastern  Female 2003     1     20-24            4
9              Eastern  Female 2003     1     25-29            9
10             Eastern  Female 2003     1     30-34            3
11             Eastern  Female 2003     1     35-39            7
12             Eastern  Female 2003     1     40-44            5
13             Eastern  Female 2003     1     45-49            5
14             Eastern  Female 2003     1     50-54            8
15             Eastern  Female 2003     1     55-59            5
16             Eastern  Female 2003     1     60-64            4
17             Eastern  Female 2003     1     65-69            7
18             Eastern  Female 2003     1     70-74            8
19             Eastern  Female 2003     1     75-79            5
20             Eastern  Female 2003     1     80-84           10
21             Eastern  Female 2003     1       85+           11
22             Eastern  Female 2003     2        -1            0
23             Eastern  Female 2003     2        -2            0
24             Eastern  Female 2003     2         0            4
25             Eastern  Female 2003     2      01-4            1
26             Eastern  Female 2003     2     05-09            2
27             Eastern  Female 2003     2     10-14            2
28             Eastern  Female 2003     2     15-19            0

我想从这个大数据帧中过滤或提取较小的数据帧。 例如,我想只有四个年龄组。这四个年龄组将分别包含:

Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+

Total.Deaths将是每个群组的总和。

所以我希望它看起来像这样

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1         0            4
2              Eastern  Female 2003     1      01-4            1
3              Eastern  Female 2003     1     05-14            1
4              Eastern  Female 2003     1       15+            104
5              Eastern  Female 2003     2         0            4
6              Eastern  Female 2003     2      01-4            1
7              Eastern  Female 2003     2     05-14            4
8              Eastern  Female 2003     2       15+            ...

我有很多数据,已经搜索了几天,但无法找到帮助这样做的功能。

1 个答案:

答案 0 :(得分:1)

使用来自recode包的car之类的内容,可能会有一种简单的方法来重新编码您的年龄变量,特别是因为您可以方便地将当前的年龄变量编码为具有与字符排序很好的级别。但是对于只有几个级别,我通常只是通过创建一个新的年龄变量手动重新编码,这种方法是在R中“完成任务”的好方法:

#Reading your data in from a text file that I made via copy/paste
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE)

#Make sure Age.Group is ordered and init new age variable
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE)
dat$AgeGroupNew <- rep(NA,nrow(dat))

#The recoding
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0"
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4"
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14"
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+"

然后我们可以使用ddplysummarise生成摘要:

datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise,
            TotalDeaths = sum(Total.Deaths))

我起初很担心,因为我指的是91次死亡,而不是104次死亡,但是我认为手数是91,而91是正确的。也许是一个错字。