确定data.table中行值的联合出现

时间:2017-05-25 13:46:36

标签: r data.table

我有一个具有以下结构的数据集:

dput(structure(foc[1:50]))
structure(list(firm_id = c("Texas", "Texas", "Texas", "Micron", 
"Micron", "DowCor", "DowCor", "DowCor", "DowCor", "DowCor", "DowCor", 
"Altera", "Altera", "Texas", "Texas", "Texas", "Molex", "Molex", 
"DowCor", "DowCor", "DowCor", "NSC", "NSC", "Micron", "Micron", 
"AAV", "AAV", "AAV", "AMD", "AMD", "DowCor", "DowCor", "Molex", 
"Molex", "Molex", "NSC", "NSC", "NSC", "Micron", "Micron", "CORN", 
"CORN", "DowCor", "DowCor", "Zilog", "Zilog", "CORN", "CORN", 
"CORN", "Micron"), pnum = c(5351876, 5351876, 5351876, 5362632, 
5362632, 5364633, 5364633, 5364633, 5364633, 5364633, 5364633, 
5369314, 5369314, 5370301, 5370301, 5370301, 5370551, 5370551, 
5371128, 5371128, 5371128, 5372410, 5372410, 5376577, 5376577, 
5383340, 5383340, 5383340, 5384272, 5384272, 5384383, 5384383, 
5384435, 5384435, 5384435, 5385861, 5385861, 5385861, 5387534, 
5387534, 5387558, 5387558, 5389365, 5389365, 5389565, 5389565, 
5392376, 5392376, 5392376, 5393694), date = structure(c(8769, 
8769, 8769, 8804, 8804, 8838, 8838, 8838, 8838, 8838, 8838, 8818, 
8818, 8769, 8769, 8769, 8772, 8772, 8779, 8779, 8779, 8798, 8798, 
8946, 8946, 8848, 8848, 8848, 8944, 8944, 8796, 8796, 8793, 8793, 
8793, 8839, 8839, 8839, 8890, 8890, 8887, 8887, 8803, 8803, 8772, 
8772, 8866, 8866, 8866, 8931), class = "Date"), PRIM = c("228", 
"257", "269", "257", "438", "264", "424", "428", "514", "521", 
"977", "326", "714", "228", "257", "269", "220", "439", "424", 
"427", "524", "188", "303", "257", "438", "257", "361", "62", 
"257", "438", "528", "556", "174", "361", "439", "148", "257", 
"438", "257", "438", "106", "501", "424", "528", "257", "438", 
"385", "428", "501", "257"), N = c(3L, 3L, 3L, 2L, 2L, 6L, 6L, 
6L, 6L, 6L, 6L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 2L)), .Names = c("firm_id", 
"pnum", "date", "PRIM", "N"), sorted = "pnum", class = c("data.table", 
"data.frame"), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000000140788>)

看起来非常像这样:

foc
       firm_id    pnum       date PRIM N
    1:   Texas 5351876 1994-01-04  228 3
    2:   Texas 5351876 1994-01-04  257 3
    3:   Texas 5351876 1994-01-04  269 3
    4:  Micron 5362632 1994-02-08  257 2
    5:  Micron 5362632 1994-02-08  438 2
   ---                                  
91731:   Intel 7472285 2003-06-25  713 3
91732:   Intel 7472289 2004-12-21  381 2
91733:   Intel 7472289 2004-12-21  713 2
91734:   Intel 7472390 2003-10-01  712 2
91735:   Intel 7472390 2003-10-01  718 2

我有一个更大的data.table,名为df,其中上面是一个子集。具体来说,上述内容始于1994年,数据集df可追溯到1980年。df中的名称相同,只是为了清楚起见,PRIM中的foc列是在prim data.table。

中调用了df

我想确定较大数据集中PRIM对的出现。当两个PRIM与同一个pnum共同发生时存在对。同一个pnum不会出现两个相同的PRIM,数据集中的每个pnum都有2到8个PRIM。 另外,我想通过使用“日期”来施加时间限制,即我只想考虑不到5年的pnum。

例如,上述数据中的第一个pnum = 5351876.它有三个不同的PRIM,因此有三对(228,257),(228,269)和(257,269)。在data.table示例中,有一个pnum有6个不同的PRIM,因此一个将有15个不同的对。请注意,一对的顺序无关紧要,因此(228,257)=(257,228)。

下面的代码也做了我需要的简单操作。它计算了每个PRIM在5年前出现的次数,但我不确定如何确定特定对出现的频率。

findpairs <- data.table()
findpairs <- data.table(rbind(findpairs, foc[, {print(.GRP) ; k = pnum ; p = PRIM ; y = unique(date)
                                        df[(date < y & date > (y - (5*365 + 1)) & p == prim), .N]}
                                          , by = .(pnum, PRIM)]))

非常欢迎任何建议

PS:在第二阶段,我希望能够包括两个“firm_id”条件:排除焦点firm_id或只查看一个firm_id。这就是为什么这个变量目前保存在data.table中但未使用的原因。

编辑1:在第一次尝试回答后,我应该澄清所需的输出。可能有一个更优化的解决方案可以生成不同的输出,但这是我认为非常棒的: 包含5列的数据表:pnumdate(pnum的日期),primpaired primpair incidence in 5y before date。请记住,对于prim首先出现的df并且仅在同一pnum内的class ResultHandler(object): def __init__(self, output_str): self.output_str = output_str def handler(self, results): print self.output_str % (result[0], result[1]) def handle_output(func, args, callback): # compute moon weight and year results = func(*args) # Invoke the callback with the result # Specifc to this problem try: for result in results: callback(result) except TypeError: pass # Try it here r1 = ResultHandler('Your weight on the moon would be %s in the year %s') r2 = ResultHandler('On the moon you would be way lighter than on Earth! In fact, you would only weigh %s kg in the year %s') handle_output(UselessFunction, (12, 3), callback=r1.handler) handle_output(UselessFunction, (12, 3), callback=r2.handler) data.table中找到两个PRIM值时才存在一对。

希望这澄清一下!

下面的功能

2 个答案:

答案 0 :(得分:0)

我想出了解决方案。您可以使用以下功能创建组合。

make_prim_pairs <- function(values,n=2){
  combinations <- (apply(t(combn(values,min(n,length(values)))),1,paste,collapse=","))
  return(combinations)
}

所以,如果你想找到整个数据集的对,那么:

findpairs <- foc[,.(primPairs = make_prim_pairs(prim)),by=pnum]

那应该通过pnum找到所有对。您可以为数据添加条件并进行配对。

y <- some_date
findpairs <- foc[date < y & date > (y - (5*365 + 1)),.(primPairs = make_prim_pairs(prim)),by=pnum]

如果有帮助,请告诉我。

答案 1 :(得分:0)

我找到了一个适用于小型数据集的解决方案,但现在已在更大的数据集上运行超过18个小时。不知道它与完成有多接近,但我想我会分享解决方案。也许有人可以理解并改进它。

# Create all possible distinct pairs of prim classes that exist in the dataset df
setkey(df, pnum)
a <- df[df, allow.cartesian = T] # cartesion join to combine all possible pairs
a <- a[a$prim != a$i.prim] # delete pairs consisting of the same prim values 
a[, idx:= .I] # add index
a$pair <- a[,paste0(min(prim, i.prim),"_",max(prim, i.prim)),by = idx][[2]] # create pairs based on a single logic:1_2 must be same as 2_1
DT1 <- a[, .N, by = .(firm_id, pnum, date, pair)] # this is to delete the repeated pairs 
rm(a)

# Create all possible distinct pairs of prim classes that exist in the subset foc
setkey(foc, pnum)
a <- foc[foc, allow.cartesian = T] # cartesion join to combine all possible pairs
a <- a[a$PRIM != a$i.PRIM] # delete pairs consisting of the same prim values 
a[, idx:= .I] # add index
a$pair <- a[,paste0(min(PRIM, i.PRIM),"_",max(PRIM, i.PRIM)),by = idx][[2]] # create pairs based on a single logic:1_2 must be same as 2_1
DT2 <- a[, .N, by = .(firm_id, pnum, date, pair)] # this is to delete the repeated pairs 

rm(a)
DT1[, N:= NULL] ; DT2[, N:= NULL] # unwanted columns

setnames(DT2, "pair", "PAIR") # only for clarity purposes in the formula below. This is the post 1994 data set. 

couples <- data.table()
couples <- data.table(rbind(couples, DT2[, {k = pnum ; p = PAIR ; y = unique(date)
                                        DT1[(date < y & date > (y - (5*365 + 1)) & p == pair), .N]}
                                          , by = .(pnum, PAIR)]))

#这个公式给出了 - 我认为 - 在过去5年中唯一对出现的次数。

couples$lowp <- sub("_.+","", couples$PAIR)   # split up the pair 
couples$highp <- sub(".+_","", couples$PAIR)  # split up the pair 

这就是诀窍。接下来的步骤是匹配lowp和highp出现在数据库中的次数(通过OP中的findpairs),这只是通过匹配完成并计算所需的变量

coup <- couples
coup$n_lowp <- counts$n_p[match(paste(coup$pnum,"",coup$lowp), paste(counts$pnum,"",counts$PRIM))]
coup$n_highp <- counts$n_p[match(paste(coup$pnum,"",coup$highp), paste(counts$pnum,"",counts$PRIM))]

coup$yaya <- with(coup, n_pairs / (n_lowp + n_highp - n_pairs))

我确信有更多有效的方法,但它有效(慢慢地)。