我正在使用R来匹配两个不同数据集中的名称。我想比较一下字符串。我基本上有两个字符串数据框,都包含一个位置ID(不是唯一的)以及人的全名。对于某些人,一个数据框具有全名,其中可能包含两个姓氏。另一个数据帧具有相同的位置代码(不是唯一的),但是姓氏只能是两个之一(总是随机选择两个)。
我想做的是grep()
,对第一个数据帧逐行进行操作,并获取第二个数据帧的输出搜索结果。我的解决方法是执行以下操作:
使用paste()
函数,粘贴位置ID和名字。这将有助于匹配。但是我确实需要匹配姓氏(可以是姓氏中的任何一个)。我们将此新向量称为location_first
在姓氏列上使用功能strsplit()
。列表中的某些元素将仅包含一项,而其他元素(即具有两个姓氏的个人)将在该元素中包含两项。我们可以将此列表称为strsplit_ln
。
然后我将以循环的形式进行第二次粘贴:将strsplit_ln
的第一个元素与location_first
粘贴,对其进行grep操作,然后移至下一个元素strplit_ln
并对此进行grep。我想在控制台上将接收到的grep
整个搜索结果打印在接收到的文本文件上。
这是我想以循环(或嵌套循环)的形式逐步进行的过程
# prepare the test data
names_df1 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
last_name = c("Williams", "Johnson Clarke", "Lopez Gutierrez", "Mar", "Yoon"), stringsAsFactors = F)
names_df2 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
last_name = c("Williams", "Clarke", "Lopez", "Mar", "Yoon"), stringsAsFactors = F)
# Step 1: paste id and first name. Location ID and First Name are identical in both data frames. I will paste the last name in the second step.
location_name_df1 = paste(names_df1$location, names_df1$first_name)
location_name_df2 = paste(names_df2$location, names_df2$first_name, names_df2$last_name)
# Step 2: string split the last names in df1. I want a loop to go through each element and subelement of this list.
last_name_strsplit = strsplit(names_df1$last_name, split = " ")
# these are what I would be searching. Note that in the loop, I go search through each sub element v of the ith element in the list.
# paste(location_name_df1[i], last_name_strsplit[[i]][v])
paste(location_name_df1[1], last_name_strsplit[[1]][1])
paste(location_name_df1[2], last_name_strsplit[[2]][1])
paste(location_name_df1[2], last_name_strsplit[[2]][2])
paste(location_name_df1[3], last_name_strsplit[[3]][1])
paste(location_name_df1[3], last_name_strsplit[[3]][2])
paste(location_name_df1[4], last_name_strsplit[[4]][1])
paste(location_name_df1[5], last_name_strsplit[[5]][1])
# this is the actual search I would like to do. I paste the location_name_df1 with the last names in last_name_strsplit, going through each element (i), as well as each sub element (v)
names_df1[grep(paste(location_name_df1[1], last_name_strsplit[[1]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][1]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. Loop should jump to the second sub element of last_name_strplit
names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][2]),location_name_df2),] # This search result was successful
names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][2]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. End of sub elements, move on to the next row
names_df1[grep(paste(location_name_df1[4], last_name_strsplit[[4]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[5], last_name_strsplit[[5]][1]),location_name_df2),] # search result successful
我很确定我必须做一个嵌套循环结构,在该结构中,我遍历列表的每个元素(i),然后遍历列表的每个子元素(v)。但是,当我执行嵌套循环时,往往会发生这样的情况:我复制了大量粘贴内容,并且搜索本身出现了问题。
有人可以给我一些有关如何通过上述步骤创建循环结构的指示吗?我再次使用R / RStudio来匹配数据。
谢谢!
答案 0 :(得分:1)
这是一种更简单的方法。首先,我们对位置和名字都进行完全联接,然后使用stringr::str_detect
(与grep
不同的是在字符串和上对向量进行矢量化处理)进行过滤姓氏不是最后一个姓氏之一的行:
full = merge(names_df1, names_df2, by = c("location", "first_name"))
library(stringr)
matches = full[str_detect(string = full$last_name.x, pattern = fixed(full$last_name.y)), ]
matches
# location first_name last_name.x last_name.y
# 1 1530 Axel Williams Williams
# 2 1530 Carlos Lopez Gutierrez Lopez
# 3 1967 Jong Yoon Yoon
# 4 6801 Bill Johnson Clarke Clarke
# 5 6801 Flavio Mar Mar
如果您更喜欢dplyr
,则可以这样操作:
library(dplyr)
full_join(names_df1, names_df2, by = c("location", "first_name")) %>%
filter(str_detect(string = last_name.x, pattern = fixed(last_name.y))