列字符串的唯一组合的子集

时间:2018-12-21 11:45:58

标签: r data.table

在较大数据集的最后日期,我有关于表现最佳的数据。接下来,我想对整个数据集进行子集化,以检索那些表现最好的数据。 “最佳表演者”是两个字符串的组合。但是,到目前为止,我还不能正确地对数据进行子集化。

我尝试使用%in%来完成部分工作,但是它包括具有一个或另一个字符串变量的所有行,而不是两者的唯一组合。

library(data.table)
best = data.table(Date = as.Date(c("2016-01-01", "2016-01-01")), x = c("a", "b"), y = c("p", "q"))
wholedt = data.table(Date = as.Date(c("2015-12-01","2015-12-01","2015-12-01","2016-01-01", "2016-01-01", "2016-01-01")), x = c("a", "c", "b", "a","a", "b"), y = c("p", "q", "q", "q","p", "q"))
SDbest_of_whole = wholedt[with(wholedt, x %in% best$x & y %in% best$y)]

预期输出将包括(a,p)和(b,q)组合的所有数据点。没有(a,q)或(b,p)的组合

expected_output = data.table(Date = as.Date(c("2015-12-01","2015-12-01","2016-01-01", "2016-01-01")), x = c("a", "b","a", "b"), y = c("p", "q","p", "q"))
> expected_output
     Date x y
1: 2015-12-01 a p
2: 2015-12-01 b q
3: 2016-01-01 a p
4: 2016-01-01 b q

2 个答案:

答案 0 :(得分:0)

确保仅使用感兴趣的组合的一种方法是merge您的数据集:

library(data.table)
best = data.table(Date = as.Date(c("2016-01-01", "2016-01-01")), x = c("a", "b"), y = c("p", "q"))
wholedt = data.table(Date = as.Date(c("2015-12-01","2015-12-01","2015-12-01","2016-01-01", "2016-01-01", "2016-01-01")), x = c("a", "c", "b", "a","a", "b"), y = c("p", "q", "q", "q","p", "q"))

best[,Date:=NULL]
merge(best, wholedt)

#    x y       Date
# 1: a p 2015-12-01
# 2: a p 2016-01-01
# 3: b q 2015-12-01
# 4: b q 2016-01-01

答案 1 :(得分:0)

对于wholedt中的每一行,您想比较best中是否有相同的行。

SDbest_of_whole <- wholedt[apply(wholedt[,c('x', 'y')], 1, function(w) any(apply(best[,c('x', 'y')], 1, identical, w))),]