在R中的两个表之间找到最佳字符串匹配

时间:2020-08-03 15:34:51

标签: r dplyr

我有两个数据帧,分别是df_1和df_2。对于df_1中的每个键,我想找到与df_2中的Form_1匹配的最佳Form_2。

IF

  1. Form_1存在于df_2中,然后进行完全匹配-例如,key = B,Form_1 =平板电脑,Form_2 =平板电脑

  2. 否则采用最短的长度匹配-例如,key = D,Form_1 = patch,ER和Form_2 = patch。这是与补丁ER匹配的最短字长。

  3. 如果它们与Form_1的匹配项超过两个,则两者都取。例如,key = G在df_2 Form_2中有两个匹配项

  4. 最后,如果没有匹配项,则默认为NA。

    df_2=data.frame(Form_2=c("suspension","for suspension","tablet","tablet,tablet","patch","patch,IR","tablet,ER","Injection","Injection,Solution","liquid"))
    
    
    df_1=data.frame(
      key=c("A","B","C","D","E","F","G","H"),
      Form_1=c("suspension","tablet","tablet,ER","patch,ER","tablet","Injection,Solution","liquid Injection",'see attachment'))

这是我的输出应为:

df_out=data.frame(
  key=c("A","B","C","D","E","F","G","G","H"),
  Form_1=c("suspension","tablet","tablet,ER","patch,ER","tablet","Injection,Solution","liquid Injection","liquid Injection",'see attachment'),
  Form_2=c("suspension","tablet","tablet,ER","patch","tablet","Injection,Solution","Liquid","Injection",NA)
)

0 个答案:

没有答案