比较两个数据框架并创建具有唯一元素的新数据框架

时间:2019-06-28 06:10:37

标签: r

我有两个数据帧。我想对其进行比较,并使用在df2中找不到的全名df1创建一个新的数据框。我正在尝试在df1中查找其名字不在df2中的名称。

df1:

   names
 1 Sally Williams
 2 Tom Hacker
 3 Jane Turner
 4 John Murray
 5 Marry Kelly Parker
 6 David Carlson Smith

df2:

  first_names
1 Kendall
2 Tom 
3 Jane 
4 Sarah
5 David

我想创建一个新的数据框,其名称在df2中找不到

df_new

  unique_names
1 Sally Williams
2 John Murray
3 Marry Kelly Parker

3 个答案:

答案 0 :(得分:3)

您可以split在空格上的字符串,获取名字,然后找到first_names的{​​{1}}中不存在的名称。

df2

或者采用df1[!sapply(strsplit(df1$names, "\\s+"),`[`, 1) %in% df2$first_names,, drop = FALSE] # names #1 Sally Williams #4 John Murray 方法

tidyverse

答案 1 :(得分:1)

我们可以使用regex_anti_join

library(fuzzyjoin)
regex_anti_join(df1, df2, by = c("names" = "first_names")) %>% 
       tibble(unique_names = .)
# A tibble: 2 x 1
#  unique_names  
#   <chr>         
#1 Sally Williams
#2 John Murray   

它还可以与第二个更新的数据集一起使用

regex_anti_join(df1N, df2N, by = c("names" = "first_names")) %>% 
       tibble(unique_names = .)
# A tibble: 3 x 1
#  unique_names      
#   <chr>             
#1 Sally Williams    
#2 John Murray       
#3 Marry Kelly Parker

或者另一种选择是使用word创建'first_name',执行anti_join然后获取输出

library(dplyr)
df1N  %>% 
   mutate(first_names = word(names, 1)) %>%
   anti_join(df2N) %>% 
   select(names(df1N))
#               names
#1     Sally Williams
#2        John Murray
#3 Marry Kelly Parker

或者另一个选择是从word中提取stringr以提取名字,然后使用%in%查找第二个数据集中匹配的元素,取反(!)并对第一个数据集的行进行子集

library(stringr)
df1N[!word(df1N$names, 1) %in% df2N$first_names,, drop = FALSE]
#               names
#1     Sally Williams
#4        John Murray
#5 Marry Kelly Parker

数据

df1 <- structure(list(names = c("Sally Williams", "Tom Hacker", "Jane Turner", 
"John Murray")), class = "data.frame", row.names = c("1", "2", 
"3", "4"))

df2 <- structure(list(first_names = c("Kendall", "Tom", "Jane", "Sarah"
)), class = "data.frame", row.names = c("1", "2", "3", "4"))

df1N <- structure(list(names = c("Sally Williams", "Tom Hacker", 
 "Jane Turner", 
"John Murray", "Marry Kelly Parker", "David Carlson Smith")), 
 class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

df2N <- structure(list(first_names = c("Kendall", "Tom", "Jane", "Sarah", 
"David")), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5"))

答案 2 :(得分:-1)

library(dplyr)

您可以使用以下内容:

setdiff(data_frame_name1, data_frame_name2)

semi_join(data_frame_name1, data_frame_name2)

anti_join(data_frame_name1, data_frame_name2)