Question

我有一个数据库，其中缺少每个唯一标识符的第一行。基本上，我需要为每个唯一的id添加一个由0组成的新行。

我的数据库看起来像那样（我有超过一百万行，所以循环基本上是不可能的）。

dt = as.data.frame( rbind(c('A1', '15', '1'), 
  c('A1', '17', '2'), 
  c('A1', '12', '3'), 
  c('B1', '3', '1'), 
  c('B1', '4', '2'), 
  c('B1', '15', '3')))

colnames(dt) = c('id', 'activity', 'time')

对于每个id，我需要在时间0添加0行。

以下代码行有效，但是我的数据库需要花费太多时间。

IdUnique = length(unique(dt$id))
VeK = vector('list',  IdUnique)
for(i in 1:IdUnique){  
  row0 = matrix(0, nrow = 1, ncol = ncol(dt), dimnames = list(unique(dt$id)[i], colnames(dt)))
  VeK[[i]] = rbind(row0, subset(dt, id == unique(dt$id)[i]) )
  VeK[[i]][,'id'] <- unique(dt$id)[i]
}

dt2 <- do.call("rbind", VeK)

我想知道是否有更经济的解决方案，比如按行合并并通过身份证明。但我无法弄清楚该怎么做。

mat = matrix(0, nrow = length(unique(dt$id)), ncol = ncol (dt) ) 
colnames(mat) <- colnames(dt)

mat[, 'id'] <- as.character(unique(dt$id))
mat <- as.data.frame(mat)

merge(mat, dt, by = 'id' )

按行合并控制标识符的任何解决方案？

Answer 1

尝试：

library(dplyr)
dt %>% 
  group_by(id) %>% 
  summarise(activity = 0, time = 0) %>% 
  merge(., dt, all = T) %>%
  arrange(id, time)

或者：

dt %>% 
  group_by(id) %>% 
  summarise_each(funs(as.character(0))) %>% 
  full_join(., dt) %>%
  arrange(id, time)

给出了：

#  id activity time
#1 A1        0    0
#2 A1       15    1
#3 A1       17    2
#4 A1       12    3
#5 B1        0    0
#6 B1        3    1
#7 B1        4    2
#8 B1       15    3

之后，如果您想将activity和time列转换为数字，则可以添加：

... %>% mutate_each(funs(type.convert(as.character(.))), -id)

<强>更新

如果您的原始dt中没有故意制作班级差异，那会更容易：

dt <- data.frame(id = c(rep("A1", 3), rep("B1", 3)),
                 activity = c(15,17,12,3,4,15),
                 time = rep(1:3, 2))

library(dplyr)
dt %>% 
  group_by(id) %>% 
  summarise(activity = 0, time = 0) %>% 
  full_join(., dt) %>%
  arrange(id, time)

Answer 2

dt = as.data.frame( rbind(c('A1', '15', '1'), 
                          c('A1', '17', '2'), 
                          c('A1', '12', '3'), 
                          c('B1', '3', '1'), 
                          c('B1', '4', '2'), 
                          c('B1', '15', '3')
                          ))

colnames(dt) = c('id', 'activity', 'time')
#Just we need to get the levels of `id` we want to bind `zeros` to
levels <- levels(dt$id)

#create a new matrix of new data we need to append to our data frame `dt`
levels_M <- cbind(id = levels , activity =  '0' , time = '0')

#then simply bind these values to the data frame
rbind(dt , levels_M)

#if you want to order the final results

dt <- dt[order(dt$id),]

对于订购也可以使用data.table库，当然它会比base R订购更快

Answer 3

首先，我猜您必须转换dt，以使activity和time属于int类，而不是factor：

dt[]<-lapply(dt,function(x) type.convert(as.character(x)))

然后，您可以使用data.table：

require(data.table)
DT<-as.data.table(dt)
DT[,lapply(.SD,function(x) c(0,x)),by=id]

为大型数据库中的每个标识符添加一行0

3 个答案: