我有两个数据帧被清理并合并为一个csv文件,数据帧是这样的
**Source Master**
chang chun petrochemical CHANG CHUN GROUP
chang chun plastics CHURCH AND DWIGHT CO INC
church dwight CITRIX SYSTEMS ASIA PACIFIC P L
citrix systems pacific CNH INDUSTRIAL N.V
现在从这些开始,我必须考虑名字并检查主名称的每个名称并找到相关的匹配并将输出打印为另一个数据框。以上数据框很少,但我正在使用20k值。
我的输出必须如下所示
**Source Master Result**
chang chun petrochemical CHANG CHUN GROUP CHANG CHUN GROUP
chang chun plastics CHURCH AND DWIGHT CO INC CHANG CHUN GROUP
church dwight CITRIX SYSTEMS ASIA PACIFIC P L CHURCH AND DWIGHT CO INC
citrix systems pacific CNH INDUSTRIAL N.V CITRIX SYSTEMS ASIA PACIFIC P L
我尝试使用此链接Merging through fuzzy matching of variables in R可能的方法,但到目前为止没有运气......!
提前感谢!!
当我将上述代码用于大量数据时,结果就是这个 -
使用的代码:
Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)
输出
NA NA 2 3 NA NA NA 6 NA NA 9 NA NA NA 12 NA NA NA 13 14 15 16 NA 18 19 20 21 22 NA 24 NA 26 NA 28 NA NA NA 30 NA NA 33 NA 35 36 37 NA 39 40 NA NA 43 NA 45 46 NA 48 49 50 51 52 53 54 55 56 57 58 NA
[68] 60 61 62 NA NA NA NA 64 NA 66 67 68 69 70 71 72 73 NA 75 76 77 78 NA 79 80 81 NA 83 84 85 86 87 88
CODE:
Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
agrep(x, Master.Names$MOD,value=TRUE) })
输出:
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] " CHURCH AND DWIGHT CO INC"
[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"
[[5]]
character(0)
即使使用for循环,也不会产生任何结果。
码
for(i in seq_len(nrow(df$ICIS_Cust_Names)))
{
df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
}
print(df$reslt)
代码2: 仅用于100行的循环
for (i in 100){
gr1$x[i] = agrep(gr1$ICIS_Cust_Names[i], gr2$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
gr2$Y[i] = agrep(gr1$ICIS_Cust_Names[i], gr2$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
结果:
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
错误
Error in `$<-.data.frame`(`*tmp*`, "x", value = c(NA, NA, " church dwight " :
replacement has 3 rows, data has 100
当观察到上面的结果被考虑时,因为它直接检查每个数据帧的行值,但我希望它考虑Source的第一个元素并检查master的所有元素和想出一个匹配,同样休息。 如果有人能纠正我的代码,我将不胜感激!提前致谢..!
答案 0 :(得分:1)
如果你只想对名字中的第一个单词检查Master.Names,这可以解决问题:
Names$Mast <- NA
for(i in seq_len(nrow(Names)))
Names$Mast[i] <- grep(toupper(x = strsplit(Names[i,1]," ")[[1]][1]), Master.Names$V1,value=TRUE)
修改
使用sapply代替循环可以获得一些速度:
Names$Mast <- sapply(Names$V1, function(x) {
grep(toupper(x = strsplit(x," ")[[1]][1]), Master.Names$V1,value=TRUE)
})
<强>结果
> Names
V1 Mast
1 chang chun petrochemical CHANG CHUN GROUP
2 chang chun plastics CHANG CHUN GROUP
3 church dwight CHURCH AND DWIGHT CO INC
4 citrix systems pacific CITRIX SYSTEMS ASIA PACIFIC P L
数据强>
Master.Names <- read.csv(text="CHANG CHUN GROUP
CHURCH AND DWIGHT CO INC
CITRIX SYSTEMS ASIA PACIFIC P L
CNH INDUSTRIAL N.V", header=FALSE)
Names <- read.csv(text="chang chun petrochemical
chang chun plastics
church dwight
citrix systems pacific", header=FALSE)