我有两个数据框。一个是
PatientID Name DOB SEX
1000017863 awssV 04-01-1936 F
1000017898 wrafdU 21-03-1971 M
1000017947 asfadfdV 29-04-1949 F
1000018029 dgdbcASK 28-12-1953 F
1000017898 wrafdU 21-03-1971 M
1000018164 adcv K 22-05-1952 M
1000018181 asfvvR 12-06-1956 M
还有一个是具有列名
的空表 ParetID PatientID Name DOB SEX
现在,我必须通过匹配名称,性别和dob来比较这两个表。如果不匹配,则通过复制所有其他字段来创建新的自动增量paretId。
输出就像
ParetID PatientID Name DOB SEX
001 1000017863 awssV 04-01-1936 F
002 1000017898 wrafdU 21-03-1971 M
003 1000017947 asfadfdV 29-04-1949 F
004 1000018029 dgdbcASK 28-12-1953 F
002 1000017898 wrafdU 21-03-1971 M
答案 0 :(得分:1)
你的结果有点奇怪。我将创建一个父data.frame并只添加新记录,并将parentid复制到其他数据。不在父data.frame中引入重复项。以下是您可以使用的内容。
第1步:从初始data.frame(df1)
创建父数据.framelibrary(dplyr)
parents <- df1 %>%
# remove dublicates.
unique() %>%
mutate(ParentId = row_number())
PatientID Name DOB SEX ParentId
1 1000017863 awssV 04-01-1936 F 1
2 1000017898 wrafdU 21-03-1971 M 2
3 1000017947 asfadfdV 29-04-1949 F 3
4 1000018029 dgdbcASK 28-12-1953 F 4
5 1000018164 adcv K 22-05-1952 M 5
6 1000018181 asfvvR 12-06-1956 M 6
第2步:向父数据添加新记录。框架
parents <- df2 %>%
# remove dublicates
unique() %>%
anti_join(parents) %>%
# add new rows on the bottom of parents
bind_rows(parents, .) %>%
mutate(ParentId = ifelse(is.na(ParentId), row_number(), ParentId))
PatientID Name DOB SEX ParentId
1 1000017863 awssV 04-01-1936 F 1
2 1000017898 wrafdU 21-03-1971 M 2
3 1000017947 asfadfdV 29-04-1949 F 3
4 1000018029 dgdbcASK 28-12-1953 F 4
5 1000018164 adcv K 22-05-1952 M 5
6 1000018181 asfvvR 12-06-1956 M 6
7 1000020202 asdf 05-05-1966 F 7 #<<< new record
第3步:将原始数据添加到parentid只需使用inner_join。
df1 %>% inner_join(parents)
Joining, by = c("PatientID", "Name", "DOB", "SEX")
PatientID Name DOB SEX ParentId
1 1000017863 awssV 04-01-1936 F 1
2 1000017898 wrafdU 21-03-1971 M 2 #<<<< duplicate entries, same parentid.
3 1000017947 asfadfdV 29-04-1949 F 3
4 1000018029 dgdbcASK 28-12-1953 F 4
5 1000017898 wrafdU 21-03-1971 M 2 #<<<< duplicate entries, same parentid.
6 1000018164 adcv K 22-05-1952 M 5
7 1000018181 asfvvR 12-06-1956 M 6
数据:
df1 <- structure(list(PatientID = c(1000017863L, 1000017898L, 1000017947L,
1000018029L, 1000017898L, 1000018164L, 1000018181L),
Name = c("awssV","wrafdU", "asfadfdV", "dgdbcASK", "wrafdU", "adcv K", "asfvvR"),
DOB = c("04-01-1936", "21-03-1971", "29-04-1949", "28-12-1953",
"21-03-1971", "22-05-1952", "12-06-1956"),
SEX = c("F", "M", "F", "F", "M", "M", "M")),
class = "data.frame", row.names = c(NA, -7L))
df2 <- structure(list(PatientID = c(1000017863L, 1000017898L, 1000020202L),
Name = c("awssV", "wrafdU", "asdf"),
DOB = c("04-01-1936", "21-03-1971", "05-05-1966"),
SEX = c("F", "M", "F")),
class = "data.frame", row.names = c(NA, -3L))