R:模糊逻辑名称匹配

时间:2015-04-10 02:40:46

标签: r fuzzy-logic

我一直在处理具有客户名称的大型数据集,每个都必须使用具有正确名称(300 KB)的主文件进行检查,如果匹配,则将主文件名附加到客户文件的名称为新列值。 My prev Question worked for small data sets

顾客和顾客主文件已使用tm进行了清理并尝试了不同的逻辑,但只适用于较小的数据集,当应用于无效的大文件时,模式匹配在这里没有帮助我的观点导致没有名称带有精确模式< / p>

Cus文件

1           chang chun petrochemical  
2                chang chun plastics  
3                     church  dwight  
4        citrix systems asia pacific  
5          cnh industrial services srl
6                   conoco phillips   
7                    conocophillips   
8                  dfk laurence varnay
9                       dtz worldwide 
10  electro motive maintenance operati
11                enterasys networks  
12                   esso  resources  
13                          expedia   
14                            expedia 
15        exponential interactive aust
16        exxonmobil asia pacific pte 
17    exxonmobil chemical asia pac div
18                     exxonmobil png 
19         formula  world championship
20      fortitech asia pacific sdn bhd

1                     chang chun group
2                     church  dwight  
3        citrix systems asia pacific  
4                    cnh industrial nv
5                      conoco phillips
6                  dfk laurence varnay
7                  dtz group  zealand 
8                         caterpillar 
9                 enterasys networks  
10                   exxon mobil group
11                       expedia group
12        exponential interactive aust
13         formula  world championship
14      fortitech asia pacific sdn bhd
15                frhi hotels  resorts
16          gardner denver industries 
17  glencore xstrata international plc
18                            grace   
19                       incomm   nz  
20              information resources 
21                    kbr holdings llc
22                       kennametal   
23                            komatsu 
24     leonhard hofstetter pelzdesign 
25          communications corporation
26              manhattan associates  
27                             mattel 
28                        mmg finance 
29                     nokia oyj group
30                           nortek  

我尝试过这个简单的循环

for (i in 1:100){
  result$x[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
  #result$Y[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}

*结果*

1           chang chun petrochemical                             <NA> NA
2                chang chun plastics                             <NA> NA
3                     church  dwight                 church  dwight    2
4        citrix systems asia pacific    citrix systems asia pacific    3
5          cnh industrial services srl                           <NA> NA
6                   conoco phillips                  church  dwight    2
7                    conocophillips                              <NA> NA
8                  dfk laurence varnay                           <NA> NA
9                       dtz worldwide                church  dwight    2
10  electro motive maintenance operati                           <NA> NA
11                enterasys networks                             <NA> NA
12                   esso  resources                 church  dwight    2
13                          expedia                              <NA> NA
14                            expedia                            <NA> NA
15        exponential interactive aust               church  dwight    2
16        exxonmobil asia pacific pte                            <NA> NA
17    exxonmobil chemical asia pac div                           <NA> NA
18                     exxonmobil png                church  dwight    2
19         formula  world championship                           <NA> NA
20      fortitech asia pacific sdn bhd 

尝试使用lapply但没有用,因为您可以注意到我的主文件很大,有时我得到行长度不匹配的错误!

mm<-dt[lapply(result, function(x) levenshteinDist(x ,lapply(result1, function(x) x)))]

#using looping stat. for checking each cus name with all the master names
for(i in seq(nrow(result)) )
    {
      if((levenshteindist(result[i],lapply(result1, function(x) String(x))))==0)
        sprintf("%s", x)
    }

哪种方法最适合这个? similar to my Q but not much helpfull我从STO中引用了几个Q

它可能很幼稚但是当应用了大量的数据集时,它会表现不佳,任何熟悉R的人都可以使用上面的levenshteinDist

代码来纠正我

 #check with each value of master file and if matches more than .90 then return master value.


for(i in seq(1:nrow(gr1))
{
  for(j in seq(1:nrow(gr2))
  {
     gr1$jar[i,j]<-jarowinkler(gr1$ICIS_Cust_Names[i],gr2$Master_Names[j])
     if(gr1$jar[i,j]>.90)
         gr1$res[i] = gr2$Master_Names[j] 

  }
}
#Please let know if there is any minute error with this code

如果有人在R中使用过这些数据请帮忙!

1 个答案:

答案 0 :(得分:0)

获得部分结果

代码:

df$result<-data.frame(df$Cust_Names, df$Master_Names[max.col(-adist(df$Cust_Names,df$Master_Names))])