Question

我正在使用R来匹配两个不同数据集中的名称。我想比较一下字符串。我基本上有两个字符串数据框，都包含一个位置ID（不是唯一的）以及人的全名。对于某些人，一个数据框具有全名，其中可能包含两个姓氏。另一个数据帧具有相同的位置代码（不是唯一的），但是姓氏只能是两个之一（总是随机选择两个）。

我想做的是grep()，对第一个数据帧逐行进行操作，并获取第二个数据帧的输出搜索结果。我的解决方法是执行以下操作：

使用paste()函数，粘贴位置ID和名字。这将有助于匹配。但是我确实需要匹配姓氏（可以是姓氏中的任何一个）。我们将此新向量称为location_first
在姓氏列上使用功能strsplit()。列表中的某些元素将仅包含一项，而其他元素（即具有两个姓氏的个人）将在该元素中包含两项。我们可以将此列表称为strsplit_ln。
然后我将以循环的形式进行第二次粘贴：将strsplit_ln的第一个元素与location_first粘贴，对其进行grep操作，然后移至下一个元素strplit_ln并对此进行grep。我想在控制台上将接收到的grep整个搜索结果打印在接收到的文本文件上。

这是我想以循环（或嵌套循环）的形式逐步进行的过程

# prepare the test data
names_df1 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
                       first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
                       last_name = c("Williams", "Johnson Clarke", "Lopez Gutierrez",  "Mar", "Yoon"), stringsAsFactors = F)

names_df2 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
                       first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
                       last_name = c("Williams", "Clarke", "Lopez", "Mar", "Yoon"), stringsAsFactors = F)


# Step 1: paste id and first name. Location ID and First Name are identical in both data frames. I will paste the last name in the second step. 
location_name_df1 = paste(names_df1$location, names_df1$first_name)
location_name_df2 = paste(names_df2$location, names_df2$first_name, names_df2$last_name)


# Step 2: string split the last names in df1. I want a loop to go through each element and subelement of this list. 
last_name_strsplit = strsplit(names_df1$last_name, split = " ")


          # these are what I would be searching. Note that in the loop, I go search through each sub element v of the ith element in the list.
          # paste(location_name_df1[i], last_name_strsplit[[i]][v])
          paste(location_name_df1[1], last_name_strsplit[[1]][1])

          paste(location_name_df1[2], last_name_strsplit[[2]][1])
          paste(location_name_df1[2], last_name_strsplit[[2]][2])

          paste(location_name_df1[3], last_name_strsplit[[3]][1])
          paste(location_name_df1[3], last_name_strsplit[[3]][2])

          paste(location_name_df1[4], last_name_strsplit[[4]][1])

          paste(location_name_df1[5], last_name_strsplit[[5]][1])


    # this is the actual search I would like to do. I paste the location_name_df1 with the last names in last_name_strsplit, going through each element (i), as well as each sub element (v)
    names_df1[grep(paste(location_name_df1[1], last_name_strsplit[[1]][1]),location_name_df2),] # search result successful

    names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][1]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. Loop should jump to the second sub element of last_name_strplit
    names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][2]),location_name_df2),] # This search result was successful

    names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][1]),location_name_df2),] # search result successful
    names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][2]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. End of sub elements, move on to the next row

    names_df1[grep(paste(location_name_df1[4], last_name_strsplit[[4]][1]),location_name_df2),] # search result successful

    names_df1[grep(paste(location_name_df1[5], last_name_strsplit[[5]][1]),location_name_df2),] # search result successful

我很确定我必须做一个嵌套循环结构，在该结构中，我遍历列表的每个元素（i），然后遍历列表的每个子元素（v）。但是，当我执行嵌套循环时，往往会发生这样的情况：我复制了大量粘贴内容，并且搜索本身出现了问题。

有人可以给我一些有关如何通过上述步骤创建循环结构的指示吗？我再次使用R / RStudio来匹配数据。

谢谢！

Answer 1

这是一种更简单的方法。首先，我们对位置和名字都进行完全联接，然后使用stringr::str_detect（与grep不同的是在字符串和上对向量进行矢量化处理）进行过滤姓氏不是最后一个姓氏之一的行：

full = merge(names_df1, names_df2, by = c("location", "first_name"))

library(stringr)
matches = full[str_detect(string = full$last_name.x, pattern = fixed(full$last_name.y)), ]
matches           
#   location first_name     last_name.x last_name.y
# 1     1530       Axel        Williams    Williams
# 2     1530     Carlos Lopez Gutierrez       Lopez
# 3     1967       Jong            Yoon        Yoon
# 4     6801       Bill  Johnson Clarke      Clarke
# 5     6801     Flavio             Mar         Mar

如果您更喜欢dplyr，则可以这样操作：

library(dplyr)
full_join(names_df1, names_df2, by = c("location", "first_name")) %>% 
  filter(str_detect(string = last_name.x, pattern = fixed(last_name.y))

如何在R中构造嵌套的For循环

1 个答案: