从xts对象中删除连续的重复行

时间:2017-11-13 19:43:00

标签: r xts

假设我有以下包含买入和卖出数据的XTS对象:

   Time     Bid    Ask
   00:01    10     11
   00:02    10     11
   00:03    11     12
   00:04    12     13
   00:05    10     11
   00:06    10     11
   00:07    10     11

   00:08    9      12
   00:09    2      10
   00:10    4      5

我想得到以下输出:

   Time     Bid    Ask
   00:01    10     11
   00:03    11     12
   00:04    12     13
   00:05    10     11

   00:08    9      12
   00:09    2      10
   00:10    4      5

只有与前一个条目相同的行才能删除。如果出现买入或卖出更改,则不会删除任何内容,因此只需取出重复项就行不通。

这应该是相当简单的,因为我之前已经这样做了,但我不记得怎么也找不到它。

更新: 我在初始数据和预期输出中添加了一些额外的条目。

约书亚的沮丧是辉煌的,但它取决于诸如rowSums之类的功能给出了不同的结果,但它打破了9 12。 我使用了matrixStats包中的rowProds函数,但是我的最后两行显然失败了。此外,我的示例第一列中的值小于第2列中的值。虽然这是有意义的,但它不是必须的,因此如果Ask小于Bid,其中rowSums和rowProds都将失败,则函数应该起作用。 有没有更好的行函数,如果有什么不同,总会给出不同的结果,例如行哈希?

1 个答案:

答案 0 :(得分:4)

You can do this by using rle() on the sum of each row.

x <- structure(
  c(10L, 10L, 11L, 12L, 10L, 10L, 10L, 11L, 11L, 12L, 13L, 11L, 11L, 11L),
  .Dim = c(7L, 2L), .Dimnames = list(NULL, c("Bid", "Ask")),
  index = structure(1:7, tzone = "", tclass = c("POSIXct", "POSIXt")),
  .indexCLASS = c("POSIXct", "POSIXt"), .indexTZ = "",
  tclass = c("POSIXct", "POSIXt"), tzone = "", class = c("xts", "zoo"))
r <- rle(rowSums(x))

If you want the last observation in each group, you can just use cumsum(r$lengths) as the row index when subsetting.

R> x[cumsum(r$lengths),]
                    Bid Ask
1969-12-31 18:00:02  10  11
1969-12-31 18:00:03  11  12
1969-12-31 18:00:04  12  13
1969-12-31 18:00:07  10  11

Since you want the first observation of each group, you need to prepend the r$lengths vector with a 1 (you always want the first observation) and then remove the last element of r$lengths. Then call cumsum() on the result.

R> x[cumsum(c(1, head(r$lengths, -1))),]
                    Bid Ask
1969-12-31 18:00:01  10  11
1969-12-31 18:00:03  11  12
1969-12-31 18:00:04  12  13
1969-12-31 18:00:05  10  11

Good catch on the limitation of rowSums(). A robust solution is to diff() the bids and asks and select the rows where either is not zero.

d <- diff(x) != 0           # rows with price changes
d[1,] <- TRUE               # always select first observation
g <- cumsum(d$Bid | d$Ask)  # groups of repeats
r <- rle(as.numeric(g))     # run length encoding on groups

# now use the solution above
x[cumsum(c(1, head(r$lengths, -1))),]