根据组内组中的匹配项创建新变量

时间:2019-04-04 22:20:42

标签: r

我有一个数据集,其中包括在两个时间点(第1集和第2集)向诊所报告的参与者。

在两次访问期间,对他们进行了两次检查,以查看感染它们的寄生虫菌株的数量,即

df_1 <- structure(list(PID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), parasite = c("parasite_1", 
"parasite_2", "parasite_1", "parasite_1", "parasite_2", "parasite_3", 
"parasite_4", "parasite_5"), episode = c("first_episode", "first_episode", 
"second_episode", "first_episode", "first_episode", "first_episode", 
"second_episode", "second_episode")), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

从数据集中:

患者1在首次访问时(寄生虫1和2)携带2个寄生虫,但是在第二次访问时,他们仅携带1个寄生虫(寄生虫1),并且与第一次发作中的一个寄生虫匹配。

患者2在首次访问时(寄生虫1、2和3)带有3个寄生虫,但是在第二次访问时,他们带有2个寄生虫(寄生虫4和5),并且与第一集的任何寄生虫都不匹配。

我需要帮助来创建一个脚本,该脚本创建一个新变量(感染),并在第二个发作期间填充“相同”,如果患者在第一个发作中呈现出寄生虫,而在出现时则表现出“不同”与第一集中所有寄生虫都不同的寄生虫,即

df_2 <- structure(list(PID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), parasite = c("parasite_1", 
"parasite_2", "parasite_1", "parasite_1", "parasite_2", "parasite_3", 
"parasite_4", "parasite_5"), episode = c("first_episode", "first_episode", 
"second_episode", "first_episode", "first_episode", "first_episode", 
"second_episode", "second_episode"), infeciton = c("same", "same", 
"same", "different", "different", "different", "different", "different"
)), row.names = c(NA, -8L), class = c("data.table", "data.frame"))

1 个答案:

答案 0 :(得分:1)

不是最好的方法,但是逻辑应该可以理解:

patients <- unique(df_1$PID)
df_3 <- df_1
df_3$infection <- NA
for (patient in patients){

  # getting your parasites into two lists
  first <- df_1[which(df_1$PID == patient & df_1$episode == "first_episode"), ]
  first <- first$parasite
  second <- df_1[which(df_1$PID == patient & df_1$episode == "second_episode"), ]
  second <- second$parasite

  # setting the infection 
  infection <- "different"
  for (parasite in second){
    if (parasite %in% first) {infection <- "same"}
    else {next}
  }
  df_3[which(df_3$PID == patient), "infection"] <- infection
}


# correcting the typo in colname in df_2:
df_2$infection <- df_2$infeciton
df_2 <- df_2[c("PID",   "parasite", "episode", "infection")]

# comparing the df_2 and df_3
identical(df_2, df_3)
# [1] TRUE