我在CSV中有一个包含500多万条记录的SQL导出。我想将行与相同的PDP_ID字段组合在一起,并将它们的值从列连接到新列
我正在使用以下功能,但它们只是执行时间太长而且似乎没有进展:
PDP_ID <- unique(data$PDP_ID)
getDetailNumbers <- function(i)(paste(data$DETAIL_NUMBER[data$PDP_ID==i],collapse="@"))
DETAIL_NUMBERS <- aaply(PDP_ID,1,getDetailNumbers,.expand=FALSE,.progress="text")
获取(PDP_ID,DETAIL_NUMBERS)data.frame后,我的计划是将其与原始数据帧合并。
PDP_ID包含大约410万条记录。处理这种情况的最快方法是什么?拆分文件? 'data'数据帧在PDP_ID上排序。我也试过使用降雪包来使用两个cpu核心,但无济于事。
Sample data:
"PDP_ID","STREETNAME_DUTCH","ACTUAL_BOX_NUMBER","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12
231313,"Street two",15
231313,"Street two",17
467626,"a third entry",1
467626,"a third entry",2
638676,"another which wont be combined",
Desired result:
"PDP_ID","STREETNAME_DUTCH","ACTUAL_BOX_NUMBER","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12@15@17
467626,"a third entry",1@2
638676,"another which wont be combined",
答案 0 :(得分:3)
你的数据有点奇怪,因为你有4个列名,只有3列,所以我删除了一个列名。
无论如何,使用data.table
这应该非常快
首先,你是数据
df <- read.csv(text = '"PDP_ID","STREETNAME_DUTCH","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12
231313,"Street two",15
231313,"Street two",17
467626,"a third entry",1
467626,"a third entry",2
638676,"another which wont be combined",')
解决方案
library(data.table)
setDT(df)[ , list(STREETNAME_DUTCH = STREETNAME_DUTCH[1],
DETAIL_NUMBER = paste(DETAIL_NUMBER, collapse = "@")), by = PDP_ID]
结果
# PDP_ID STREETNAME_DUTCH DETAIL_NUMBER
# 1: 111115 An entry which wont be combined NA
# 2: 231313 Street two 12@15@17
# 3: 467626 a third entry 1@2
# 4: 638676 another which wont be combined NA
或者,您可以尝试dplyr
(也非常快)
重要提示:dtach
plyr
包首先使用detach("package:plyr", unload=TRUE)
解决方案
library(dplyr)
df %>%
group_by(PDP_ID) %>%
summarise(STREETNAME_DUTCH = STREETNAME_DUTCH[1],
DETAIL_NUMBER = paste(DETAIL_NUMBER, collapse = "@"))
结果
# Source: local data frame [4 x 3]
#
# PDP_ID STREETNAME_DUTCH DETAIL_NUMBER
# 1 111115 An entry which wont be combined NA
# 2 231313 Street two 12@15@17
# 3 467626 a third entry 1@2
# 4 638676 another which wont be combined NA