如何在R中构造嵌套的For循环

时间:2018-12-18 01:14:36

标签: r loops grep nested-loops

我正在使用R来匹配两个不同数据集中的名称。我想比较一下字符串。我基本上有两个字符串数据框,都包含一个位置ID(不是唯一的)以及人的全名。对于某些人,一个数据框具有全名,其中可能包含两个姓氏。另一个数据帧具有相同的位置代码(不是唯一的),但是姓氏只能是两个之一(总是随机选择两个)。

我想做的是grep(),对第一个数据帧逐行进行操作,并获取第二个数据帧的输出搜索结果。我的解决方法是执行以下操作:

  1. 使用paste()函数,粘贴位置ID和名字。这将有助于匹配。但是我确实需要匹配姓氏(可以是姓氏中的任何一个)。我们将此新向量称为location_first

  2. 在姓氏列上使用功能strsplit()。列表中的某些元素将仅包含一项,而其他元素(即具有两个姓氏的个人)将在该元素中包含两项。我们可以将此列表称为strsplit_ln

  3. 然后我将以循环的形式进行第二次粘贴:将strsplit_ln的第一个元素与location_first粘贴,对其进行grep操作,然后移至下一个元素strplit_ln并对此进行grep。我想在控制台上将接收到的grep整个搜索结果打印在接收到的文本文件上。

这是我想以循环(或嵌套循环)的形式逐步进行的过程

# prepare the test data
names_df1 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
                       first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
                       last_name = c("Williams", "Johnson Clarke", "Lopez Gutierrez",  "Mar", "Yoon"), stringsAsFactors = F)

names_df2 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
                       first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
                       last_name = c("Williams", "Clarke", "Lopez", "Mar", "Yoon"), stringsAsFactors = F)


# Step 1: paste id and first name. Location ID and First Name are identical in both data frames. I will paste the last name in the second step. 
location_name_df1 = paste(names_df1$location, names_df1$first_name)
location_name_df2 = paste(names_df2$location, names_df2$first_name, names_df2$last_name)


# Step 2: string split the last names in df1. I want a loop to go through each element and subelement of this list. 
last_name_strsplit = strsplit(names_df1$last_name, split = " ")


          # these are what I would be searching. Note that in the loop, I go search through each sub element v of the ith element in the list.
          # paste(location_name_df1[i], last_name_strsplit[[i]][v])
          paste(location_name_df1[1], last_name_strsplit[[1]][1])

          paste(location_name_df1[2], last_name_strsplit[[2]][1])
          paste(location_name_df1[2], last_name_strsplit[[2]][2])

          paste(location_name_df1[3], last_name_strsplit[[3]][1])
          paste(location_name_df1[3], last_name_strsplit[[3]][2])

          paste(location_name_df1[4], last_name_strsplit[[4]][1])

          paste(location_name_df1[5], last_name_strsplit[[5]][1])


    # this is the actual search I would like to do. I paste the location_name_df1 with the last names in last_name_strsplit, going through each element (i), as well as each sub element (v)
    names_df1[grep(paste(location_name_df1[1], last_name_strsplit[[1]][1]),location_name_df2),] # search result successful

    names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][1]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. Loop should jump to the second sub element of last_name_strplit
    names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][2]),location_name_df2),] # This search result was successful

    names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][1]),location_name_df2),] # search result successful
    names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][2]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. End of sub elements, move on to the next row

    names_df1[grep(paste(location_name_df1[4], last_name_strsplit[[4]][1]),location_name_df2),] # search result successful

    names_df1[grep(paste(location_name_df1[5], last_name_strsplit[[5]][1]),location_name_df2),] # search result successful

我很确定我必须做一个嵌套循环结构,在该结构中,我遍历列表的每个元素(i),然后遍历列表的每个子元素(v)。但是,当我执行嵌套循环时,往往会发生这样的情况:我复制了大量粘贴内容,并且搜索本身出现了问题。

有人可以给我一些有关如何通过上述步骤创建循环结构的指示吗?我再次使用R / RStudio来匹配数据。

谢谢!

1 个答案:

答案 0 :(得分:1)

这是一种更简单的方法。首先,我们对位置和名字都进行完全联接,然后使用stringr::str_detect(与grep不同的是在字符串上对向量进行矢量化处理)进行过滤姓氏不是最后一个姓氏之一的行:

full = merge(names_df1, names_df2, by = c("location", "first_name"))

library(stringr)
matches = full[str_detect(string = full$last_name.x, pattern = fixed(full$last_name.y)), ]
matches           
#   location first_name     last_name.x last_name.y
# 1     1530       Axel        Williams    Williams
# 2     1530     Carlos Lopez Gutierrez       Lopez
# 3     1967       Jong            Yoon        Yoon
# 4     6801       Bill  Johnson Clarke      Clarke
# 5     6801     Flavio             Mar         Mar

如果您更喜欢dplyr,则可以这样操作:

library(dplyr)
full_join(names_df1, names_df2, by = c("location", "first_name")) %>% 
  filter(str_detect(string = last_name.x, pattern = fixed(last_name.y))