如何通过比较两个数据帧来生成新的id

时间:2018-06-17 08:12:31

标签: r

我有两个数据框。一个是

PatientID   Name          DOB          SEX
1000017863  awssV       04-01-1936      F
1000017898  wrafdU      21-03-1971      M
1000017947  asfadfdV    29-04-1949      F
1000018029  dgdbcASK    28-12-1953      F
1000017898  wrafdU      21-03-1971      M
1000018164  adcv  K     22-05-1952      M
1000018181  asfvvR      12-06-1956      M

还有一个是具有列名

的空表
 ParetID  PatientID    Name       DOB      SEX

现在,我必须通过匹配名称,性别和dob来比较这两个表。如果不匹配,则通过复制所有其他字段来创建新的自动增量paretId。

输出就像

ParetID    PatientID    Name        DOB          SEX
001       1000017863    awssV      04-01-1936      F
002       1000017898    wrafdU     21-03-1971      M
003       1000017947    asfadfdV   29-04-1949      F
004       1000018029    dgdbcASK   28-12-1953      F
002       1000017898    wrafdU     21-03-1971      M

1 个答案:

答案 0 :(得分:1)

你的结果有点奇怪。我将创建一个父data.frame并只添加新记录,并将parentid复制到其他数据。不在父data.frame中引入重复项。以下是您可以使用的内容。

第1步:从初始data.frame(df1)

创建父数据.frame
library(dplyr)

parents <- df1 %>%
  # remove dublicates.
  unique() %>% 
  mutate(ParentId = row_number())

   PatientID     Name        DOB SEX ParentId
1 1000017863    awssV 04-01-1936   F        1
2 1000017898   wrafdU 21-03-1971   M        2
3 1000017947 asfadfdV 29-04-1949   F        3
4 1000018029 dgdbcASK 28-12-1953   F        4
5 1000018164  adcv  K 22-05-1952   M        5
6 1000018181   asfvvR 12-06-1956   M        6

第2步:向父数据添加新记录。框架

parents <- df2 %>% 
  # remove dublicates
  unique() %>% 
  anti_join(parents) %>% 
  # add new rows on the bottom of parents
  bind_rows(parents, .) %>% 
  mutate(ParentId = ifelse(is.na(ParentId), row_number(), ParentId))

   PatientID     Name        DOB SEX ParentId
1 1000017863    awssV 04-01-1936   F        1
2 1000017898   wrafdU 21-03-1971   M        2
3 1000017947 asfadfdV 29-04-1949   F        3
4 1000018029 dgdbcASK 28-12-1953   F        4
5 1000018164  adcv  K 22-05-1952   M        5
6 1000018181   asfvvR 12-06-1956   M        6
7 1000020202     asdf 05-05-1966   F        7     #<<< new record

第3步:将原始数据添加到parentid只需使用inner_join。

df1 %>% inner_join(parents) 
Joining, by = c("PatientID", "Name", "DOB", "SEX")
   PatientID     Name        DOB SEX ParentId
1 1000017863    awssV 04-01-1936   F        1
2 1000017898   wrafdU 21-03-1971   M        2   #<<<< duplicate entries, same parentid.
3 1000017947 asfadfdV 29-04-1949   F        3
4 1000018029 dgdbcASK 28-12-1953   F        4
5 1000017898   wrafdU 21-03-1971   M        2   #<<<< duplicate entries, same parentid.
6 1000018164  adcv  K 22-05-1952   M        5
7 1000018181   asfvvR 12-06-1956   M        6

数据:

df1 <- structure(list(PatientID = c(1000017863L, 1000017898L, 1000017947L, 
                             1000018029L, 1000017898L, 1000018164L, 1000018181L), 
               Name = c("awssV","wrafdU", "asfadfdV", "dgdbcASK", "wrafdU", "adcv  K", "asfvvR"),
               DOB = c("04-01-1936", "21-03-1971", "29-04-1949", "28-12-1953", 
                                        "21-03-1971", "22-05-1952", "12-06-1956"),
               SEX = c("F", "M", "F", "F", "M", "M", "M")), 
          class = "data.frame", row.names = c(NA, -7L))

df2 <- structure(list(PatientID = c(1000017863L, 1000017898L, 1000020202L), 
               Name = c("awssV", "wrafdU", "asdf"), 
               DOB = c("04-01-1936", "21-03-1971", "05-05-1966"), 
               SEX = c("F", "M", "F")), 
          class = "data.frame", row.names = c(NA, -3L))