Question

我有两个表，都有以下字段：

日期
ID
蔬菜
水果
度量

df2是df1的子集。 df1有~8k记录，df2有大约4k。

我的目标是创建一个新的df，或者将一个列添加到父数据框df1，并使用true / false来确定日期/ id组合是否在df2中退出。基本上是查找。

我应该去查找表路由还是应该通过加入df1和2来创建新数据框？

我无法加入id，必须是id和date的组合，因为有些id会在不同的日期返回。

我尝试了left_join()

comb <- left_join(x = df1, y = df2, by=c("date", "id"))

但是当我真的只想保留df1列时，结果返回了水果和蔬菜的列：

日期
ID
vegetables.x
fruits.x
metric.x
vegetables.y
fruits.y
metric.y

我想要的只是：

日期
ID
蔬菜
水果
InDF2（布尔）
度量

确定df1中哪些行（日期+ id）也存在于df2（date + id）中的最佳方法是什么？

Answer 1

也许paste date和id成为每个df的向量，例如df1_vector和df2_vector
使用%in%
尝试df1$df2_presence_check = paste(df1$date,df1$id) %in% paste(df2$date,df2$id)

示例

set.seed(42) a = sample(letters, 5) b = sample(letters,15) a %in% b #[1] FALSE FALSE TRUE TRUE TRUE #OR b %in% a #[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE

Answer 2

这是一个完全在dplyr中的解决方案。我假设id和date代表并且是唯一的密钥。

让我们为再现性添加一些数据

set.seed(23489)
n <- 10

df1 <- data.frame(
  id=sample(1e4:9e4, n),
  date=sample(seq(as.Date('2015/01/01'), as.Date('2017/01/01'), by="day"), n),
  vegetables= c("Broccoli", "Cabbage", "Calabrese", "Carrots", "Cauliflower", 
                "Celery", "Chard", "Endive", "Fiddleheads", "Frisee"),
  fruits=c("Jabuticaba", "Jackfruit", "Jambul", "Jujube", "Juniper berry",
           "Kiwi", "Kumquat", "Lemon", "Lime", "Loquat"),
  metric=rnorm(n=n)
)

df2 <- df1[sample(seq_len(nrow(df1)), n/2), ]

接下来，我们生成您想要的输出

df1 %>%
  left_join(select(mutate(df2, InDF2=TRUE), id, date, InDF2), by=c("id", "date")) %>%
  mutate(InDF2=ifelse(is.na(InDF2), FALSE, TRUE))

#       id       date  vegetables        fruits      metric InDF2
# 1  80283 2016-11-26    Broccoli    Jabuticaba  1.68765979 FALSE
# 2  14766 2016-10-18     Cabbage     Jackfruit -0.16774908 FALSE
# 3  19532 2015-03-29   Calabrese        Jambul -1.18328968  TRUE
# 4  46187 2015-03-09     Carrots        Jujube  1.83044569 FALSE
# 5  76852 2016-01-11 Cauliflower Juniper berry -0.05744373 FALSE
# 6  45507 2015-10-27      Celery          Kiwi -1.78166251 FALSE
# 7  65227 2016-07-07       Chard       Kumquat -1.84756162  TRUE
# 8  71433 2015-05-25      Endive         Lemon  0.77346596  TRUE
# 9  17002 2016-10-22 Fiddleheads          Lime  1.09118108  TRUE
# 10 52797 2015-06-29      Frisee        Loquat -0.46491328  TRUE

DPLYR查找或加入？怎么解决这个问题？

2 个答案: