Question

我有两个包含有关公司信息的数据框，我需要将第一个数据框中的公司与第二个df中的公司进行匹配。不幸的是，不可能创建100％自动算法，因为两个表中没有统一的ID，而且公司名称并不总是相同。因此，我考虑创建一种算法来建议公司名称之间的相似性（通过stringdist程序包），但允许人员进行最终检查以比较所有变量并选择最佳匹配。

下面有一个可复制的小例子。如果数据集与此处显示的数据集一样简单，则此示例中的解决方案将对我有用。不幸的是，数据集太大，无法在控制台中打印表。因此，我想知道如何让R打开一个对话框，该对话框显示我当前正在控制台上打印的表格，并且我想在此对话框中插入最匹配的内容。

非常感谢您的支持。

# create df 1 to be matched
df_to_be_matched <- data.frame(name = c("name_130","name_90"), year=c(2011,2012), val = c(rnorm(2)))

# create df with options to search for the best match
df_options <- data.frame( ID= sample(c("A","B","C"),10,replace = T),
                  year = sample(c(2010, 2011, 2012), 10,replace = T),
                  name = paste0("name_",sample(1:200,10)),
                  value = rnorm(10,10,2))

# loop to choose best match
for(i in 1:nrow(df_to_be_matched)){
  print("-------------------------------------------")
  mtch <- df_to_be_matched[i,] 

  aux_options <- df_options %>%
    mutate(compar = stringdist::stringdist(mtch$name,df_options$name)) %>% # create comparison index
    arrange(compar) %>% # sort it
    mutate(idx = seq(1,nrow(df_options)))

  print(aux_options) # print table with options on the console
  print("")
  cat (paste0("Type the index from the best match to: Company = ",mtch$name,"; year = ",mtch$year,
              "; value = ",mtch$val,". Than press [enter]")) 

  line <- as.numeric( readline()) # enter index for the best match

  if(line %in% aux_options$idx){
    print("") # jump lines
    print("Match is: ")
    print("")
    print(aux_options[line,])
    print("")
    print("")
  }else{
    print("No match")
  }
}

在对话框中打印表格并读取输入

0 个答案: