我有两个数据帧。我想对其进行比较,并使用在df2中找不到的全名df1创建一个新的数据框。我正在尝试在df1中查找其名字不在df2中的名称。
df1:
names
1 Sally Williams
2 Tom Hacker
3 Jane Turner
4 John Murray
5 Marry Kelly Parker
6 David Carlson Smith
df2:
first_names
1 Kendall
2 Tom
3 Jane
4 Sarah
5 David
我想创建一个新的数据框,其名称在df2中找不到
df_new
unique_names
1 Sally Williams
2 John Murray
3 Marry Kelly Parker
答案 0 :(得分:3)
您可以split
在空格上的字符串,获取名字,然后找到first_names
的{{1}}中不存在的名称。
df2
或者采用df1[!sapply(strsplit(df1$names, "\\s+"),`[`, 1) %in% df2$first_names,, drop = FALSE]
# names
#1 Sally Williams
#4 John Murray
方法
tidyverse
答案 1 :(得分:1)
我们可以使用regex_anti_join
library(fuzzyjoin)
regex_anti_join(df1, df2, by = c("names" = "first_names")) %>%
tibble(unique_names = .)
# A tibble: 2 x 1
# unique_names
# <chr>
#1 Sally Williams
#2 John Murray
它还可以与第二个更新的数据集一起使用
regex_anti_join(df1N, df2N, by = c("names" = "first_names")) %>%
tibble(unique_names = .)
# A tibble: 3 x 1
# unique_names
# <chr>
#1 Sally Williams
#2 John Murray
#3 Marry Kelly Parker
或者另一种选择是使用word
创建'first_name',执行anti_join
然后获取输出
library(dplyr)
df1N %>%
mutate(first_names = word(names, 1)) %>%
anti_join(df2N) %>%
select(names(df1N))
# names
#1 Sally Williams
#2 John Murray
#3 Marry Kelly Parker
或者另一个选择是从word
中提取stringr
以提取名字,然后使用%in%
查找第二个数据集中匹配的元素,取反(!
)并对第一个数据集的行进行子集
library(stringr)
df1N[!word(df1N$names, 1) %in% df2N$first_names,, drop = FALSE]
# names
#1 Sally Williams
#4 John Murray
#5 Marry Kelly Parker
df1 <- structure(list(names = c("Sally Williams", "Tom Hacker", "Jane Turner",
"John Murray")), class = "data.frame", row.names = c("1", "2",
"3", "4"))
df2 <- structure(list(first_names = c("Kendall", "Tom", "Jane", "Sarah"
)), class = "data.frame", row.names = c("1", "2", "3", "4"))
df1N <- structure(list(names = c("Sally Williams", "Tom Hacker",
"Jane Turner",
"John Murray", "Marry Kelly Parker", "David Carlson Smith")),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
df2N <- structure(list(first_names = c("Kendall", "Tom", "Jane", "Sarah",
"David")), class = "data.frame", row.names = c("1", "2", "3",
"4", "5"))
答案 2 :(得分:-1)
library(dplyr)
您可以使用以下内容:
setdiff(data_frame_name1, data_frame_name2)
或
semi_join(data_frame_name1, data_frame_name2)
或
anti_join(data_frame_name1, data_frame_name2)