R中数据帧的子集

时间:2013-05-06 12:07:18

标签: r dataframe subset

我有2个数据框df2DF

> DF
        date tickers
1 2000-01-01       B
2 2000-01-01    GOOG
3 2000-01-01       V
4 2000-01-01    YHOO
5 2000-01-02     XOM

> df2
        date tickers quantities
1 2000-01-01      BB         11
2 2000-01-01     XOM         23
3 2000-01-01    GOOG         42
4 2000-01-01    YHOO         21
5 2000-01-01       V       2112
6 2000-01-01       B         13
7 2000-01-02     XOM         24
8 2000-01-02      BB        422

我需要df2DF中存在的值。这意味着我需要以下输出:

3 2000-01-01    GOOG         42
4 2000-01-01    YHOO         21
5 2000-01-01       V       2112
6 2000-01-01       B         13
7 2000-01-02     XOM         24

所以我使用了以下代码:

> subset(df2,df2$date %in% DF$date & df2$tickers %in% DF$tickers)
        date tickers quantities
2 2000-01-01     XOM         23
3 2000-01-01    GOOG         42
4 2000-01-01    YHOO         21
5 2000-01-01       V       2112
6 2000-01-01       B         13
7 2000-01-02     XOM         24

但是输出包含一个额外的列。这是因为ticker中的df2'xom'在2天内出现。> dput(DF) structure(list(date = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("2000-01-01", "2000-01-02"), class = "factor"), tickers = structure(c(4L, 5L, 6L, 8L, 7L), .Label = c("A", "AA", "AAPL", "B", "GOOG", "V", "XOM", "YHOO", "Z"), class = "factor")), .Names = c("date", "tickers" ), row.names = c(NA, -5L), class = "data.frame") > dput(df2) structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("2000-01-01", "2000-01-02"), class = "factor"), tickers = structure(c(2L, 5L, 3L, 6L, 4L, 1L, 5L, 2L), .Label = c("B", "BB", "GOOG", "V", "XOM", "YHOO"), class = "factor"), quantities = c(11, 23, 42, 21, 2112, 13, 24, 422)), .Names = c("date", "tickers", "quantities"), row.names = c(NA, -8L), class = "data.frame") 。所以两行都被选中了。我的代码需要进行哪些修改?

输入如下:

{{1}}

2 个答案:

答案 0 :(得分:3)

使用sqldf包:

require(sqldf)

sqldf("SELECT d2.date, d2.tickers, d2.quantities FROM df2 d2 
       JOIN DF d1 ON d1.date=d2.date AND d1.tickers=d2.tickers")

##        date tickers quantities
## 1 2000-01-01    GOOG         42
## 2 2000-01-01    YHOO         21
## 3 2000-01-01       V       2112
## 4 2000-01-01       B         13
## 5 2000-01-02     XOM         24

答案 1 :(得分:1)

这不是from my answer to this post of yours那么不同,但需要稍加修改:

df2[duplicated(rbind(DF, df2[,1:2]))[-seq_len(nrow(DF))], ]

#         date tickers quantities
# 3 2000-01-01    GOOG         42
# 4 2000-01-01    YHOO         21
# 5 2000-01-01       V       2112
# 6 2000-01-01       B         13
# 7 2000-01-02     XOM         24

注意:这为输出提供了与df2中的行相同的顺序。


或者,正如Ben建议的那样,使用merge

merge(df2, DF, by=c("date", "tickers"))

也会给出相同的结果(但不一定是相同的顺序)。