Question

我在R中有一个data.table，它来自一个如下所示的数据库：

date,identifier,description,location,value1,value2
2014-03-01,1,foo,1,100,200
2014-03-01,1,foo,2,200,300
2014-04-01,1,foo,1,100,200
2014-04-01,1,foo,2,100,200
2014-05-01,1,foo,1,100,200
2014-05-01,1,foo,2,100,200
2014-03-01,2,bar,1,100,200
2014-04-01,2,bar,1,100,200
2014-05-01,2,bar,1,100,200
2014-03-01,3,baz,1,100,200
2014-03-01,3,baz,2,200,300
2014-04-01,3,baz,1,100,200
2014-04-01,3,baz,2,100,200
2014-05-01,3,baz,1,100,200
2014-05-01,3,baz,2,100,200
2014-05-01,4,quux,2,100,200
<SNIP>

为了对数据进行一些计算，我想按摩它，以便日期，标识符，描述和位置的每个组合在表中有一行，其中NA为value1和value2。我知道日期的范围和所有可能的位置值。

我是R和data.table的新手，我的思绪在这一点上很难。我想为上面的示例表提出的结果是：

date,identifier,description,location,value1,value2
2014-03-01,1,foo,1,100,200
2014-03-01,1,foo,2,200,300
2014-04-01,1,foo,1,100,200
2014-04-01,1,foo,2,100,200
2014-05-01,1,foo,1,100,200
2014-05-01,1,foo,2,100,200
2014-03-01,2,bar,1,100,200
2014-03-01,2,bar,2,NA,NA
2014-04-01,2,bar,1,100,200
2014-04-01,2,bar,2,NA,NA
2014-05-01,2,bar,1,100,200
2014-05-01,2,bar,2,NA,NA
2014-03-01,3,baz,1,100,200
2014-03-01,3,baz,2,200,300
2014-04-01,3,baz,1,100,200
2014-04-01,3,baz,2,100,200
2014-05-01,3,baz,1,100,200
2014-05-01,3,baz,2,100,200
2014-03-01,4,quux,1,NA,NA
2014-03-01,4,quux,2,NA,NA
2014-04-01,4,quux,1,NA,NA
2014-04-01,4,quux,2,NA,NA
2014-05-01,4,quux,1,NA,NA
2014-05-01,4,quux,2,100,200

数据库中的数据很稀疏，因为给定的标识符/描述/位置组合对于每个日期可以具有任意数量的条目或者根本没有条目。我希望在给定的日期范围内（例如，2014-03-01至2014-05-01），每个标识符/描述和位置在表格中都有一行。

这似乎有一些有趣的数据。可行的技巧，但我在消隐。

编辑：我通过合并另一个数据表以较小的比例为一个标识符/描述做了这个，但我不知道如何通过增加多个标识符/描述和位置的复杂性来做到这一点。

非常感谢您的回复。

这是原始数据的输出输出，可以很容易地复制到R：

structure(list(date = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 2L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), 
.Label = c("2014-03-01", "2014-04-01", "2014-05-01"), class = "factor"), 
identifier = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L),     
description = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 4L), 
.Label = c("bar", "baz", "foo", "quux"), class = "factor"), 
location = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L), 
value1 = c(100L, 200L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 200L, 100L, 100L, 100L, 100L, 100L), 
value2 = c(200L, 300L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 300L, 200L, 200L, 200L, 200L, 200L)), 
.Names = c("date", "identifier", "description", "location", "value1", "value2"), 
row.names = c(NA, -16L),
class = c("data.table", "data.frame"))

Answer 1

在@akrun和@eddi的帮助下，这是惯用的（？）方式：

mycols  = c("description","date","location")
setkeyv(DT0,mycols)
DT1 <- DT0[J(do.call(CJ,lapply(mycols,function(x)unique(get(x)))))]
# alternately: DT1 <- DT0[DT0[,do.call(CJ,lapply(.SD,unique)),.SDcols=mycols]]

新行缺少identifier列，但可以填充：

setkey(DT1,description)
DT1[unique(DT0[,c("description","identifier"),with=FALSE]),identifier:=i.identifier]

Answer 2

如果我理解正确的问题 - 并且只使用基数R，而不是任何特殊的数据。表：

# The fields for whose every permutation we require a row
unique.fields <- c("date", "identifier", "description", "location")
filler <- expand.grid(sapply(unique.fields, function(f) unique(foo[,f])) )
merge(filler, foo, by=unique.fields,  all.x=TRUE)

使用R data.table填写缺少的行

2 个答案: