我一直在处理具有客户名称的大型数据集,每个都必须使用具有正确名称(300 KB)的主文件进行检查,如果匹配,则将主文件名附加到客户文件的名称为新列值。 My prev Question worked for small data sets
顾客和顾客主文件已使用tm
进行了清理并尝试了不同的逻辑,但只适用于较小的数据集,当应用于无效的大文件时,模式匹配在这里没有帮助我的观点导致没有名称带有精确模式< / p>
Cus文件
1 chang chun petrochemical
2 chang chun plastics
3 church dwight
4 citrix systems asia pacific
5 cnh industrial services srl
6 conoco phillips
7 conocophillips
8 dfk laurence varnay
9 dtz worldwide
10 electro motive maintenance operati
11 enterasys networks
12 esso resources
13 expedia
14 expedia
15 exponential interactive aust
16 exxonmobil asia pacific pte
17 exxonmobil chemical asia pac div
18 exxonmobil png
19 formula world championship
20 fortitech asia pacific sdn bhd
主
1 chang chun group
2 church dwight
3 citrix systems asia pacific
4 cnh industrial nv
5 conoco phillips
6 dfk laurence varnay
7 dtz group zealand
8 caterpillar
9 enterasys networks
10 exxon mobil group
11 expedia group
12 exponential interactive aust
13 formula world championship
14 fortitech asia pacific sdn bhd
15 frhi hotels resorts
16 gardner denver industries
17 glencore xstrata international plc
18 grace
19 incomm nz
20 information resources
21 kbr holdings llc
22 kennametal
23 komatsu
24 leonhard hofstetter pelzdesign
25 communications corporation
26 manhattan associates
27 mattel
28 mmg finance
29 nokia oyj group
30 nortek
我尝试过这个简单的循环
for (i in 1:100){
result$x[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
#result$Y[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
*结果*
1 chang chun petrochemical <NA> NA
2 chang chun plastics <NA> NA
3 church dwight church dwight 2
4 citrix systems asia pacific citrix systems asia pacific 3
5 cnh industrial services srl <NA> NA
6 conoco phillips church dwight 2
7 conocophillips <NA> NA
8 dfk laurence varnay <NA> NA
9 dtz worldwide church dwight 2
10 electro motive maintenance operati <NA> NA
11 enterasys networks <NA> NA
12 esso resources church dwight 2
13 expedia <NA> NA
14 expedia <NA> NA
15 exponential interactive aust church dwight 2
16 exxonmobil asia pacific pte <NA> NA
17 exxonmobil chemical asia pac div <NA> NA
18 exxonmobil png church dwight 2
19 formula world championship <NA> NA
20 fortitech asia pacific sdn bhd
尝试使用lapply
但没有用,因为您可以注意到我的主文件很大,有时我得到行长度不匹配的错误!
mm<-dt[lapply(result, function(x) levenshteinDist(x ,lapply(result1, function(x) x)))]
#using looping stat. for checking each cus name with all the master names
for(i in seq(nrow(result)) )
{
if((levenshteindist(result[i],lapply(result1, function(x) String(x))))==0)
sprintf("%s", x)
}
哪种方法最适合这个? similar to my Q but not much helpfull我从STO中引用了几个Q
它可能很幼稚但是当应用了大量的数据集时,它会表现不佳,任何熟悉R的人都可以使用上面的levenshteinDist
码
#check with each value of master file and if matches more than .90 then return master value.
for(i in seq(1:nrow(gr1))
{
for(j in seq(1:nrow(gr2))
{
gr1$jar[i,j]<-jarowinkler(gr1$ICIS_Cust_Names[i],gr2$Master_Names[j])
if(gr1$jar[i,j]>.90)
gr1$res[i] = gr2$Master_Names[j]
}
}
#Please let know if there is any minute error with this code
如果有人在R中使用过这些数据请帮忙!
答案 0 :(得分:0)
获得部分结果
代码:
df$result<-data.frame(df$Cust_Names, df$Master_Names[max.col(-adist(df$Cust_Names,df$Master_Names))])