如何从一个data.table中选择行以应用于另一个data.table?

时间:2016-02-10 08:41:55

标签: r data.table

我有两个data.tables df(21个MIO行)和tmp(500k行)

df有三列将原始专利(origpat)与参考专利(refpat)相关联,并将原始分类(mainprim)与{{origpat绑定1}}。

显示了30条第一行以下。每个origpatrefpat 都是唯一的,但每origpat个出现1至300次,每refpat之间出现dput(df[1:30,-3]) structure(list(origpat = c(4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592), ref.pat = c(4318978, 4436368, 4358181, 4478622, 4312654, 4293439, 4286061, 4363648, 4406517, 4478623, 4277285, 4375743, 4470520, 4328022, 4248614, 4297139, 4296607, 4296608, 4395271, 4321141, 4294190, 4431420, 4322467, 4285730, 4393138, 4246034, 4251278, 4339174, 4277322, 4290586), mainprim = c("442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442", "442")), .Names = c("origpat", "ref.pat", "mainprim"), row.names = c(NA, 30L), class = c("data.table", "data.frame")) 1和3,100次

tmp

pnum包含专利列表prim及其各自的主要分类origpatrefpat中的所有dfpnum都是tmp中的tmp(专利号)。 作为示例数据,我选择了df data.table,其中包含与dput

所选dput(tmp) structure(list(pnum = c("4318978", "4318978", "4318978", "4318978", "4318978", "4318978", "4318978", "4318978", "4436368", "4436368", "4436368", "4436368", "4358181", "4358181", "4358181", "4358181", "4478622", "4312654", "4312654", "4312654", "4312654", "4312654", "4312654", "4293439", "4293439", "4293439", "4293439", "4293439", "4293439", "4293439", "4293439", "4293439", "4293439", "4293439", "4293439", "4293439", "4286061", "4286061", "4286061", "4286061", "4286061", "4286061", "4286061", "4286061", "4363648", "4363648", "4363648", "4406517", "4478623", "4478623", "4277285", "4375743", "4375743", "4375743", "4375743", "4470520", "4470520", "4470520", "4328022", "4328022", "4248614", "4248614", "4248614", "4248614", "4248614", "4248614", "4297139", "4297139", "4297139", "4297139", "4297139", "4296607", "4296607", "4296607", "4296607", "4296607", "4296607", "4296608", "4296608", "4296608", "4296608", "4296608", "4395271", "4395271", "4395271", "4321141", "4321141", "4321141", "4321141", "4294190", "4294190", "4294190", "4294190", "4294190", "4294190", "4431420", "4431420", "4431420", "4431420", "4431420", "4431420", "4322467", "4322467", "4322467", "4322467", "4322467", "4322467", "4322467", "4322467", "4322467", "4322467", "4285730", "4285730", "4393138", "4393138", "4393138", "4393138", "4393138", "4393138", "4393138", "4246034", "4246034", "4246034", "4246034", "4251278", "4251278", "4251278", "4339174", "4339174", "4339174", "4339174", "4277322", "4277322", "4290586", "4290586", "4290586", "4290586", "4290586", "4247592", "4247592", "4247592", "4247592", "4247592", "4247592", "4247592", "4247592", "4247592"), prim = c("430", "430", "430", "430", "430", "430", "430", "430", "340", "385", "385", "385", "385", "385", "65", "65", "65", "118", "427", "65", "65", "65", "65", "106", "106", "106", "501", "501", "501", "501", "501", "516", "516", "516", "516", "516", "435", "435", "435", "435", "435", "435", "435", "435", "156", "428", "65", "385", "65", "65", "501", "422", "53", "53", "53", "222", "422", "604", "65", "65", "385", "385", "65", "65", "65", "65", "106", "106", "501", "501", "501", "252", "423", "423", "501", "505", "62", "423", "501", "501", "505", "62", "65", "65", "65", "210", "210", "210", "435", "118", "118", "118", "118", "118", "118", "106", "433", "433", "433", "433", "501", "156", "427", "427", "428", "428", "428", "428", "428", "428", "428", "501", "501", "426", "426", "426", "435", "435", "435", "435", "428", "501", "501", "501", "501", "501", "65", "385", "385", "385", "65", "204", "204", "204", "266", "266", "432", "73", "427", "427", "428", "442", "442", "442", "442", "8", "8")), .Names = c("pnum", "prim" ), class = c("data.table", "data.frame"), row.names = c(NA, -147L ), .internal.selfref = <pointer: 0x0000000000100788>) 个变量相关的所有信息
mainprim

现在,我想将origpat(与prim相关联)与链接到refpat的不同library(data.table) df <- data.table(df) ; setkey(df, refpat, origpat) refs <- unique(df$refpat) # Capture all unique refpat in df (71,000 in entire data.table) startrow <- 0 # Set loop overlap <- function(a,b) sum (a == b) / length(b) df$compare <- NA # overlap values will be inserted here for (h in 1:length(refs)) { refclass <- tmp$prim[tmp$pnum == refs[h]] #subgroup of relevant 'prim' x <- length(df$refpat[df$refpat == refs[h]]) prims <- df$mainprim[startrow:(startrow + x)] # isolate subset from large `df` data.table to reduce memory needed in second loop for (i in 1:x) { df$compare[startrow + i] <- overlap(prims[i], refclass) } startrow <- startrow + x print(h) } 变量进行比较。

以下代码有效,但速度过慢。

for

我使用两个refclass循环的原因是为了节省计算机内存。我可以使用一个,并为每一行重新确定tmp,但这使我的计算机在几分钟内崩溃。这个循环可以工作,但速度可以在大约250小时内完成。 我确信有一些方法可以在df内简单地对origpat中所需的行进行子集化,然后对每个df重复此操作,但我的data.table技能不能胜任任务而且我找不到 解释 的答案,如何在SO或data.table pdf文件上进行此操作。

非常欢迎任何建议

EDIt @Frank我想要做的具体比较总是在改变。主要问题如下。考虑一个较长的pnum,其中包含两列链接origpat(专利号),一列名为ref.pat,第二列名为pnum。每列包含多个重复的pnum,但每个组合(在单行上)都是唯一的。它在公司专利和较早的专利之间建立了联系。该数据集大约有22个MIO行。 然后我有多个其他数据表,例如一个将inventorspnum相关联,一个将df与各种技术分类相关联。我感兴趣的是找到以成对方式比较链接数据(例如发明人,技术类)的最快方法,在origpat的行中定义对(即ref.pat和{ {1}})。到目前为止,我所拥有的data.table解决方案速度最快,但仍需要多天才能完成一次新的比较。 希望这有帮助

2 个答案:

答案 0 :(得分:2)

我最好的想法是:

df[,idx := .I] # Add an index to the data.table to group by row of df
df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
     length(tmp[pnum == ref.pat,prim]),by = idx]

或重复使用overlap功能(仍在使用idx列):

df[,compare := overlap(
                mainprim,
                tmp[pnum == ref.pat, prim]),
    by=idx]

这里的功能是按行分组,然后使用子集数据中的列来获取此行的mainprim以及所需的tmp子集。

如果您想避免创建idx列,可以使用by=1:nrow(df),但这可能会降低流程速度(在[{1}}中使用实际列更快)。

@Docendo的重大改进:

您可以通过创建一个中间变量来存储子集,而不是每行执行两次子集,从而进一步加快流程:

data.table

如果df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx] 中有ref.patmainprim的重复组合,您可以使用df代替by = list(ref.pat, mainprim)进一步优化效果:< / p>

by = idx

使用df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = list(ref.pat, mainprim)] 而不是mean()可以完成另一项(可能只是极小的改进):

sum()/length()

答案 1 :(得分:0)

如果我对问题的理解是正确的,那么您需要在ref.pat上加入这两个表格。确保ref.pat中的dfpnum中的tmp的类别相同。然后通过以下方式获得所需的连接:

library(data.table)

df <- data.table(df)
tmp <- data.table(tmp)

setkey(df, 'ref.pat')
out <- df[tmp, nomatch = 0]