组合大型数据集的行+连接值(将SQL导出转换为多值)

时间:2014-08-11 09:41:16

标签: r

我在CSV中有一个包含500多万条记录的SQL导出。我想将行与相同的PDP_ID字段组合在一起,并将它们的值从列连接到新列

我正在使用以下功能,但它们只是执行时间太长而且似乎没有进展:

PDP_ID <- unique(data$PDP_ID)

getDetailNumbers <- function(i)(paste(data$DETAIL_NUMBER[data$PDP_ID==i],collapse="@"))

DETAIL_NUMBERS <- aaply(PDP_ID,1,getDetailNumbers,.expand=FALSE,.progress="text")

获取(PDP_ID,DETAIL_NUMBERS)data.frame后,我的计划是将其与原始数据帧合并。

PDP_ID包含大约410万条记录。处理这种情况的最快方法是什么?拆分文件? 'data'数据帧在PDP_ID上排序。我也试过使用降雪包来使用两个cpu核心,但无济于事。

Sample data:

"PDP_ID","STREETNAME_DUTCH","ACTUAL_BOX_NUMBER","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12
231313,"Street two",15
231313,"Street two",17
467626,"a third entry",1
467626,"a third entry",2
638676,"another which wont be combined",

Desired result:

"PDP_ID","STREETNAME_DUTCH","ACTUAL_BOX_NUMBER","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12@15@17
467626,"a third entry",1@2
638676,"another which wont be combined",

1 个答案:

答案 0 :(得分:3)

你的数据有点奇怪,因为你有4个列名,只有3列,所以我删除了一个列名。

无论如何,使用data.table这应该非常快

首先,你是数据

df <- read.csv(text = '"PDP_ID","STREETNAME_DUTCH","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12
231313,"Street two",15
231313,"Street two",17
467626,"a third entry",1
467626,"a third entry",2
638676,"another which wont be combined",')

解决方案

library(data.table)
setDT(df)[ , list(STREETNAME_DUTCH = STREETNAME_DUTCH[1],
                  DETAIL_NUMBER = paste(DETAIL_NUMBER, collapse = "@")), by = PDP_ID]

结果

#    PDP_ID                STREETNAME_DUTCH DETAIL_NUMBER
# 1: 111115 An entry which wont be combined            NA
# 2: 231313                      Street two      12@15@17
# 3: 467626                   a third entry           1@2
# 4: 638676  another which wont be combined            NA

或者,您可以尝试dplyr(也非常快)

重要提示dtach plyr包首先使用detach("package:plyr", unload=TRUE)

解决方案

library(dplyr)
df %>%
  group_by(PDP_ID) %>%
  summarise(STREETNAME_DUTCH = STREETNAME_DUTCH[1],
            DETAIL_NUMBER = paste(DETAIL_NUMBER, collapse = "@"))

结果

# Source: local data frame [4 x 3]
# 
#   PDP_ID                STREETNAME_DUTCH DETAIL_NUMBER
# 1 111115 An entry which wont be combined            NA
# 2 231313                      Street two      12@15@17
# 3 467626                   a third entry           1@2
# 4 638676  another which wont be combined            NA