加快拆解汇总数据的速度

时间:2020-06-30 17:59:32

标签: r for-loop rbind

我已经收到270万行的聚合数据和13列聚合数据。我正在尝试快速打开包装。简短的潜在数据和代码示例如下:

列类型=性别,年龄,头发颜色,眼睛颜色,生日,调查日期,计数

new_data <- tibble(sex = c("M", "M", "F", "F", "F"), 
         age = c(18, 27, 34, 21, 25),
         hair_color = c("Brown", "Blonde","Black", "Pink", "Blonde"),
         eye_color = c("Brown", "Blue","Black", "Green", "Green"),
         birth_state = c("AK", "CO","CO", "FL", "CA"),
         survey_date = as.POSIXct(c("1/1/2020", "5/1/2020","2/2/2020", "1/10/2020", "1/1/2020"),format = "%d/%m/%Y"),
         count = c(10,14,6,8,6))

第一行最后一个类别中的“ 10”(计数)表明有10个人与前面的列信息相匹配。我正在尝试将数据集解包,使其具有10条全都具有相同的先前信息,而不是单条末尾带有“ 10”的行。

下面是我当前的代码:

i <- as.numeric(nrow(new_data))
check <- new_data

for (i in 1:i){
  k <- new_data[i,]
  j <- new_data[i,7]-1
  u <- data.frame(t(replicate(j, k, simplify = TRUE)))
  l <- list(check, u)
  check <- do.call("rbind", l)
  print(i)
}

check$cnts <- 1
end <- Sys.time()
start-end

Code thoughts:
i - find out how many total lines i will have to duplicate
k - pulling the line that I am going to duplicate
j - identifying how many times I need to repeat the line
u - establish a data.frame which uses replicate to generate multiple lines. I have to transpose it to get it into the right frame work.
l - create two lists to be joined.
check - to call "rbind()" to link the list together. This is providing me the best data.frame I have found yet.

check$cnts <- 1 just changes the count from whatever it was to 1. This indicates that it is a single survey point.

我尝试了几种不同的方法,但似乎找不到快速运行的方法。我已经尝试拆开包装几天了,它仍在运行。当我尝试了其他软件包(rbind(),repeat()等)时,我最终得到一个data.frame,其中嵌入了列表。研究该问题后,我似乎发现rbind()有时会遇到问题。

我读了另一篇有关rbind()的文章,看来我不能使用多个处理器来加快处理速度。任何帮助将不胜感激!

0 个答案:

没有答案