根据其他列的条件拉取数据帧行的子集

时间:2018-05-11 14:07:39

标签: r dataframe datatable subset

我有dataframe,如下所示:

x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
                Type=c("put","call","put","call","call","put","call","put","call","put","call"),
                Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
                Other=sample(20,11))

    Tickers Type Strike Other
 1:       A  put   35.0     6
 2:       A call   37.5     5
 3:       A  put   37.5    13
 4:       B call   10.0    15
 5:       B call   11.0    12
 6:       B  put   11.0     4
 7:       B call   12.0    20
 8:       D  put   40.0     7
 9:       D call   40.0    11
10:       D  put   42.0    10
11:       D call   42.0     1

我正在尝试分析数据的子集。我想要的子集是tickerstrike相同的数据。但是,如果put下存在calltype,我也只想获取此数据。以上面的数据为例,我想返回以下结果:

x[c(2,3,5,6,8:11),]

   Tickers Type Strike Other
1:       A call   37.5     5
2:       A  put   37.5    13
3:       B call   11.0    12
4:       B  put   11.0     4
5:       D  put   40.0     7
6:       D call   40.0    11
7:       D  put   42.0    10
8:       D call   42.0     1

我不确定这样做的最佳方法是什么。我的思维过程是我应该创建另一个列向量,如

x$id <- paste(x$Tickers,x$Strike,sep="_")

然后使用此向量仅拉出有多个ID的值。

x[x$id %in% x$id[duplicated(x$id)],]

   Tickers Type Strike Other     id
1:       A call   37.5     5 A_37.5
2:       A  put   37.5    13 A_37.5
3:       B call   11.0    12   B_11
4:       B  put   11.0     4   B_11
5:       D  put   40.0     7   D_40
6:       D call   40.0    11   D_40
7:       D  put   42.0    10   D_42
8:       D call   42.0     1   D_42

我不确定这是多么有效,因为我的实际数据包含更多行。 此外,此解决方案不会检查type条件是否有一个put和一个call

标题的措辞可能会好很多,我道歉

编辑:::查看了这篇文章Finding ALL duplicate rows, including "elements with smaller subscripts"

我也可以使用这个解决方案:

x$id <- paste(x$Tickers,x$Strike,sep="_")
x[duplicated(x$id) | duplicated(x$id,fromLast=T),]

2 个答案:

答案 0 :(得分:2)

对您的数据进行修改,以提供putcall都不存在的情况(我将最后一次“调用”更改为“put”):

x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
            Type=c("put","call","put","call","call","put","call","put","call","put","put"),
            Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
            Other=sample(20,11))

由于您使用的是data.table,因此您可以使用内置计数器.Nby变量来计算组和子集。如果通过计算Type,您可以可靠地确定putcall,这可能会有效:

x[, `:=`(n = .N, types = uniqueN(Type)), by = c('Tickers', 'Strike')][n > 1 & types == 2]

第一组[]中包含的部分进行计数,然后[n > 1 & types == 2]执行子集化。

答案 1 :(得分:0)

我不是包data.table的用户,因此此代码仅为基础R.

agg <- aggregate(Type ~ Tickers + Strike, data = x, length)
result <- merge(x, subset(agg, Type > 1)[1:2], by = c("Tickers", "Strike"))[, c(1, 3, 2, 4)]
result
#   Tickers Type Strike Other
#1:       A call   37.5    17
#2:       A  put   37.5     7
#3:       B call   11.0    14
#4:       B  put   11.0    20
#5:       D  put   40.0    15
#6:       D call   40.0     2
#7:       D  put   42.0     8
#8:       D call   42.0     1


rm(agg)    # final clean up