我有两个data.tables df
(21个MIO行)和tmp
(500k行)
df
有三列将原始专利(origpat
)与参考专利(refpat
)相关联,并将原始分类(mainprim
)与{{origpat
绑定1}}。
显示了30条第一行以下。每个origpat
,refpat
对 都是唯一的,但每origpat
个出现1至300次,每refpat
之间出现dput(df[1:30,-3])
structure(list(origpat = c(4247592, 4247592, 4247592, 4247592,
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592,
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592,
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592,
4247592, 4247592, 4247592, 4247592, 4247592), ref.pat = c(4318978,
4436368, 4358181, 4478622, 4312654, 4293439, 4286061, 4363648,
4406517, 4478623, 4277285, 4375743, 4470520, 4328022, 4248614,
4297139, 4296607, 4296608, 4395271, 4321141, 4294190, 4431420,
4322467, 4285730, 4393138, 4246034, 4251278, 4339174, 4277322,
4290586), mainprim = c("442", "442", "442", "442", "442", "442",
"442", "442", "442", "442", "442", "442", "442", "442", "442",
"442", "442", "442", "442", "442", "442", "442", "442", "442",
"442", "442", "442", "442", "442", "442")), .Names = c("origpat",
"ref.pat", "mainprim"), row.names = c(NA, 30L), class = c("data.table",
"data.frame"))
1和3,100次
tmp
pnum
包含专利列表prim
及其各自的主要分类origpat
。 refpat
中的所有df
和pnum
都是tmp
中的tmp
(专利号)。
作为示例数据,我选择了df
data.table,其中包含与dput
dput(tmp)
structure(list(pnum = c("4318978", "4318978", "4318978", "4318978",
"4318978", "4318978", "4318978", "4318978", "4436368", "4436368",
"4436368", "4436368", "4358181", "4358181", "4358181", "4358181",
"4478622", "4312654", "4312654", "4312654", "4312654", "4312654",
"4312654", "4293439", "4293439", "4293439", "4293439", "4293439",
"4293439", "4293439", "4293439", "4293439", "4293439", "4293439",
"4293439", "4293439", "4286061", "4286061", "4286061", "4286061",
"4286061", "4286061", "4286061", "4286061", "4363648", "4363648",
"4363648", "4406517", "4478623", "4478623", "4277285", "4375743",
"4375743", "4375743", "4375743", "4470520", "4470520", "4470520",
"4328022", "4328022", "4248614", "4248614", "4248614", "4248614",
"4248614", "4248614", "4297139", "4297139", "4297139", "4297139",
"4297139", "4296607", "4296607", "4296607", "4296607", "4296607",
"4296607", "4296608", "4296608", "4296608", "4296608", "4296608",
"4395271", "4395271", "4395271", "4321141", "4321141", "4321141",
"4321141", "4294190", "4294190", "4294190", "4294190", "4294190",
"4294190", "4431420", "4431420", "4431420", "4431420", "4431420",
"4431420", "4322467", "4322467", "4322467", "4322467", "4322467",
"4322467", "4322467", "4322467", "4322467", "4322467", "4285730",
"4285730", "4393138", "4393138", "4393138", "4393138", "4393138",
"4393138", "4393138", "4246034", "4246034", "4246034", "4246034",
"4251278", "4251278", "4251278", "4339174", "4339174", "4339174",
"4339174", "4277322", "4277322", "4290586", "4290586", "4290586",
"4290586", "4290586", "4247592", "4247592", "4247592", "4247592",
"4247592", "4247592", "4247592", "4247592", "4247592"), prim = c("430",
"430", "430", "430", "430", "430", "430", "430", "340", "385",
"385", "385", "385", "385", "65", "65", "65", "118", "427", "65",
"65", "65", "65", "106", "106", "106", "501", "501", "501", "501",
"501", "516", "516", "516", "516", "516", "435", "435", "435",
"435", "435", "435", "435", "435", "156", "428", "65", "385",
"65", "65", "501", "422", "53", "53", "53", "222", "422", "604",
"65", "65", "385", "385", "65", "65", "65", "65", "106", "106",
"501", "501", "501", "252", "423", "423", "501", "505", "62",
"423", "501", "501", "505", "62", "65", "65", "65", "210", "210",
"210", "435", "118", "118", "118", "118", "118", "118", "106",
"433", "433", "433", "433", "501", "156", "427", "427", "428",
"428", "428", "428", "428", "428", "428", "501", "501", "426",
"426", "426", "435", "435", "435", "435", "428", "501", "501",
"501", "501", "501", "65", "385", "385", "385", "65", "204",
"204", "204", "266", "266", "432", "73", "427", "427", "428",
"442", "442", "442", "442", "8", "8")), .Names = c("pnum", "prim"
), class = c("data.table", "data.frame"), row.names = c(NA, -147L
), .internal.selfref = <pointer: 0x0000000000100788>)
个变量相关的所有信息
mainprim
现在,我想将origpat
(与prim
相关联)与链接到refpat
的不同library(data.table)
df <- data.table(df) ; setkey(df, refpat, origpat)
refs <- unique(df$refpat) # Capture all unique refpat in df (71,000 in entire data.table)
startrow <- 0 # Set loop
overlap <- function(a,b) sum (a == b) / length(b)
df$compare <- NA # overlap values will be inserted here
for (h in 1:length(refs)) {
refclass <- tmp$prim[tmp$pnum == refs[h]] #subgroup of relevant 'prim'
x <- length(df$refpat[df$refpat == refs[h]])
prims <- df$mainprim[startrow:(startrow + x)] # isolate subset from large `df` data.table to reduce memory needed in second loop
for (i in 1:x) {
df$compare[startrow + i] <- overlap(prims[i], refclass)
}
startrow <- startrow + x
print(h)
}
变量进行比较。
以下代码有效,但速度过慢。
for
我使用两个refclass
循环的原因是为了节省计算机内存。我可以使用一个,并为每一行重新确定tmp
,但这使我的计算机在几分钟内崩溃。这个循环可以工作,但速度可以在大约250小时内完成。
我确信有一些方法可以在df
内简单地对origpat
中所需的行进行子集化,然后对每个df
重复此操作,但我的data.table技能不能胜任任务而且我找不到 解释 的答案,如何在SO或data.table pdf文件上进行此操作。
非常欢迎任何建议
EDIt @Frank我想要做的具体比较总是在改变。主要问题如下。考虑一个较长的pnum
,其中包含两列链接origpat
(专利号),一列名为ref.pat
,第二列名为pnum
。每列包含多个重复的pnum
,但每个组合(在单行上)都是唯一的。它在公司专利和较早的专利之间建立了联系。该数据集大约有22个MIO行。
然后我有多个其他数据表,例如一个将inventors
与pnum
相关联,一个将df
与各种技术分类相关联。我感兴趣的是找到以成对方式比较链接数据(例如发明人,技术类)的最快方法,在origpat
的行中定义对(即ref.pat
和{ {1}})。到目前为止,我所拥有的data.table解决方案速度最快,但仍需要多天才能完成一次新的比较。
希望这有帮助
答案 0 :(得分:2)
我最好的想法是:
df[,idx := .I] # Add an index to the data.table to group by row of df
df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
length(tmp[pnum == ref.pat,prim]),by = idx]
或重复使用overlap
功能(仍在使用idx列):
df[,compare := overlap(
mainprim,
tmp[pnum == ref.pat, prim]),
by=idx]
这里的功能是按行分组,然后使用子集数据中的列来获取此行的mainprim
以及所需的tmp
子集。
如果您想避免创建idx
列,可以使用by=1:nrow(df)
,但这可能会降低流程速度(在[{1}}中使用实际列更快)。
@Docendo的重大改进:
您可以通过创建一个中间变量来存储子集,而不是每行执行两次子集,从而进一步加快流程:
data.table
如果df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]
中有ref.pat
和mainprim
的重复组合,您可以使用df
代替by = list(ref.pat, mainprim)
进一步优化效果:< / p>
by = idx
使用df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
by = list(ref.pat, mainprim)]
而不是mean()
可以完成另一项(可能只是极小的改进):
sum()/length()
答案 1 :(得分:0)
如果我对问题的理解是正确的,那么您需要在ref.pat
上加入这两个表格。确保ref.pat
中的df
和pnum
中的tmp
的类别相同。然后通过以下方式获得所需的连接:
library(data.table)
df <- data.table(df)
tmp <- data.table(tmp)
setkey(df, 'ref.pat')
out <- df[tmp, nomatch = 0]