向下填充具有NA的列(具有R base或data.table)

时间:2013-09-17 03:00:37

标签: r data.table census

我想使用人口普查' county-adjacency数据,但 am 一直停留在一个漂亮的形式。数据分为四列:第一个县,第一个代码,第二个县,第二个代码。第一个县列不重复,而是取值#34;"我现在读的方式是:

                     c1   cd1                    c2   cd2
1   Alamance County, NC 37001   Alamance County, NC 37001
2                          NA    Caswell County, NC 37033
3                          NA    Chatham County, NC 37037
4                          NA   Guilford County, NC 37081
5                          NA     Orange County, NC 37135
6                          NA   Randolph County, NC 37151
7                          NA Rockingham County, NC 37157
8  Alexander County, NC 37003  Alexander County, NC 37003
9                          NA   Caldwell County, NC 37027
10                         NA    Catawba County, NC 37035
11                         NA    Iredell County, NC 37097
12                         NA     Wilkes County, NC 37193
13 Alleghany County, NC 37005  Alleghany County, NC 37005
14                         NA       Ashe County, NC 37009
15                         NA      Surry County, NC 37171
16                         NA     Wilkes County, NC 37193
17                         NA    Grayson County, VA 51077
18     Anson County, NC 37007      Anson County, NC 37007
19                         NA Montgomery County, NC 37123
20                         NA   Richmond County, NC 37153

我碰巧只对该链接中找到的北卡罗来纳州部分数据感兴趣,其中一部分就是您在上面看到的内容:

#
nc_cc <- structure(list(c1 = c("Alamance County, NC", "", "", "", "", "", "", "Alexander County, NC", "", "", "", "", "Alleghany County, NC", "", "", "", "", "Anson County, NC", "", ""), cd1 = c(37001L, NA, NA, NA, NA, NA, NA, 37003L, NA, NA, NA, NA, 37005L, NA, NA, NA, NA, 37007L, NA, NA), c2 = c("Alamance County, NC", "Caswell County, NC", "Chatham County, NC", "Guilford County, NC", "Orange County, NC", "Randolph County, NC", "Rockingham County, NC", "Alexander County, NC", "Caldwell County, NC", "Catawba County, NC", "Iredell County, NC", "Wilkes County, NC", "Alleghany County, NC", "Ashe County, NC", "Surry County, NC", "Wilkes County, NC", "Grayson County, VA", "Anson County, NC", "Montgomery County, NC", "Richmond County, NC" ), cd2 = c(37001L, 37033L, 37037L, 37081L, 37135L, 37151L, 37157L, 37003L, 37027L, 37035L, 37097L, 37193L, 37005L, 37009L, 37171L, 37193L, 51077L, 37007L, 37123L, 37153L)), .Names = c("c1", "cd1", "c2", "cd2"), row.names = c(NA, 20L), class = "data.frame")
#

我想要一个干净的邻接关联(并且县名是多余的),所以我想要的输出可以采用多种形式:data.frame,list,...

我提出的粗略解决方案(经过深思熟虑)是:

require(data.table)
DT <- data.table(nc_cc)
DT[,list(cd1=cd1[1],cd2),by=cumsum(!is.na(cd1))][,list(cd1,cd2)]

      cd1   cd2
 1: 37001 37001
 2: 37001 37033
 3: 37001 37037
 4: 37001 37081
 5: 37001 37135
 6: 37001 37151
 7: 37001 37157
 8: 37003 37003
 9: 37003 37027
10: 37003 37035
11: 37003 37097
12: 37003 37193
13: 37005 37005
14: 37005 37009
15: 37005 37171
16: 37005 37193
17: 37005 51077
18: 37007 37007
19: 37007 37123
20: 37007 37153

我已使用data.table对此进行了标记,因为我在上面的解决方案中使用了它,并且我怀疑可以通过roll完成一些不错的事情。真的,我从来没有理解roll的文档,所以我希望在这里学到一些东西......所以:这可以做得更好吗?

编辑: This question提出同样的问题,所以我将问题修改为:&#34;有更好的方法可以使用data.table或基数R(因为我反对安装更多的软件包)?&#34;

2 个答案:

答案 0 :(得分:11)

这样做的一个非常标准的方法是:

library(data.table)
dt = data.table(nc_cc)

dt[, cd1 := cd1[1], by = cumsum(!is.na(cd1))]

答案 1 :(得分:0)

我根据@Arun's的答案找到了roll解决方案!

在我的应用程序中,使用@eddi(......和我在陈述问题中)的cumsum答案要复杂得多:

DT <- data.table(nc_cc)
setkey(DT[,i:=.I],i)

DT[
    DT[c1!=""][J(1:20),roll=TRUE][,list(c1,cd1),key=i],
    `:=`(c1=i.c1,cd1=i.cd1)
]

我从from @eddi回答我的另一个问题时学到了i.name