Question

我正在尝试在一个长数据集（df1）中创建一个变量，其中每行中的值都需要基于匹配另一个长数据集（df2）中的某些条件的基础。条件是： -匹配“名称” -df1的值应考虑在df1中进行观察之前对该人的观察。 -然后，我需要该子集中满足第三个条件的行数（在下面的数据中称为“条件”）

我已经尝试过运行for循环（我知道，R中不希望使用它）为1：nrow（df1）中的每一行写它，但是我一直遇到一个问题，即我的实际数据df1和df2的长度或长度不相同。

我还尝试编写一个函数并将其应用于df1。我尝试使用apply来应用它，但是我不能接受apply语法中的两个数据帧。我尝试给它一个数据帧列表并使用lapply，但是它返回了空值。

这里有一些适合我正在使用的数据格式的常规数据。

df1 <- data.frame(
  name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
  date_b = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4))

df2 <- data.frame(
  name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
  date_a = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4),
  condition = c("A", "B", "C", "A")
)

我知道获取行数的方法可能看起来像这样：

num_conditions <- nrow(df2[which(df1$nam== df2$name & df2$date_a < df1$date_b & df2$condition == "A"), ])

我想在df1中看到一个名为“ num_conditions”的列，该列将显示在df2中该人在df1中date_b之前发生并满足条件“ A”的观察次数。

df1应该看起来像这样：

name          date_b    num_conditions
John Smith    10/1/15           1
John Smith    11/15/16          0
John Smith    9/19/19           0

Answer 1

我敢肯定，包括data.table在内的方法更好，但这是使用dplyr的一种方法：

library(dplyr)

set.seed(12)

df2 %>%
  filter(condition == "A") %>%
  right_join(df1, by = "name") %>%
  group_by(name, date_b) %>%
  filter(date_a < date_b) %>%
  mutate(num_conditions = n()) %>%
  right_join(df1, by = c("name", "date_b")) %>%
  mutate(num_conditions = coalesce(num_conditions, 0L)) %>%
  select(-c(date_a, condition)) %>%
  distinct()

# A tibble: 4 x 3
# Groups:   name, date_b [4]
  name       date_b     num_conditions
  <fct>      <date>              <int>
1 John Smith 2016-10-13              2
2 John Smith 2015-11-10              2
3 Jane Smith 2016-07-18              1
4 Jane Smith 2018-03-13              1

R> df1
        name     date_b
1 John Smith 2016-10-13
2 John Smith 2015-11-10
3 Jane Smith 2016-07-18
4 Jane Smith 2018-03-13

R> df2
        name     date_a condition
1 John Smith 2015-04-16         A
2 John Smith 2014-09-27         A
3 Jane Smith 2017-04-25         C
4 Jane Smith 2015-08-20         A

Answer 2

也许下面是这个问题的要求。

library(tidyverse)

df1 %>%
  left_join(df2 %>% filter(condition == 'A'), by = 'name') %>%
  filter(date_a < date_b) %>%
  group_by(name) %>%
  mutate(num_conditions = n()) %>%
  select(-date_a, -condition) %>%
  full_join(df1) %>%
  mutate(num_conditions = ifelse(is.na(num_conditions), 0, num_conditions))
#Joining, by = c("name", "date_b")
## A tibble: 4 x 3
## Groups:   name [2]
#  name       date_b     num_conditions
#  <fct>      <date>              <dbl>
#1 John Smith 2019-05-07              2
#2 John Smith 2019-02-05              2
#3 Jane Smith 2016-05-03              0
#4 Jane Smith 2018-06-23              0

根据两个数据集中的匹配条件创建变量

2 个答案: