我有两个数据集,数据集df1的一列包含在我们的CRM中注册的公司名称,另一列的包含销售经理的名称。数据集df2的一列中包含访问过IT事件的公司的名称。
由于数据集df2是由参与者手动输入的,因此写入时会出现拼写错误,缩写等。也就是说,与CRM中注册的公司名称相似。
因此,目标是将访问数据集中df2中事件的公司的名称与在数据集中df1中注册的公司的名称进行比较,并将这些比较分配给销售经理。当然,找不到或比较远的名称应具有销售员的NA值。
我是R的新手,我正在尝试各种尝试,但收效甚微。
您能帮我建立这个脚本吗?
下面是示例:
df1 df2
|----------------|----------------| |----------------|
| Company | Sales Manager | | Company Event |
|----------------|----------------| |----------------|
|Customer 1 SA |Erik | |Customer 1 |
|Customer 2 S\A |Selma | |Customer 1 SA. |
|Customer 3 Ltda.|Juca | |Customer2 |
|Customer 4 |Batista | |cUSTOIMER 3 |
|----------------|----------------| |Customer 10 |
|----------------|
预期的最终结果是将另一个具有交叉数据的df。
matched df
|----------------|----------------|----------------|
| Company Event | Company | Sales Manager |
|----------------|----------------|----------------|
|Customer 1 |Customer 1 SA |Erik |
|Customer 1 SA. |Customer 1 SA |Erik |
|Customer2 |Customer 2 S\A |Selma |
|cUSTOIMER 3 |Customer 3 Ltda.|Juca |
|Customer 10 |NA |NA |
|----------------|----------------|----------------|
答案 0 :(得分:-1)
以下应该起作用。它涉及清理名称,获取最短距离,然后获取销售经理信息。
onFileSelected(event) {
this.selectedFile = event.target.file[0];
模糊字符串匹配是..很好,很模糊,因此您可能遇到的情况不是您所期望的,但是经过一些调整后您应该会满意(这里将library(stringdist)
# declare data ------------------------------------------------------------
Company <- c("Customer 1 SA" ,"Customer 2 S/A", "Customer 3 Ltda.", "Customer 4")
SalesManager <- c("Erik", "Selma", "Juca", "Batista")
CompanyEvent <- c("Customer 1", "Customer 1 SA.", "Customer2" , "cUSTOIMER 3", "Customer 10")
df1 <- data.frame(Company, SalesManager, stringsAsFactors = F)
df2 <- data.frame(CompanyEvent, stringsAsFactors = F)
# clean 'dirty' names -----------------------------------------------------
df1$cleannames <- gsub("S/A", "", df1$Company)
df1$cleannames <- gsub("SA", "", df1$cleannames)
df1$cleannames <- gsub("Ltda.", "", df1$cleannames)
df1$cleannames <- gsub(" ", "", df1$cleannames)
df1$cleannames <-tolower(df1$cleannames)
df2$cleannames <- gsub("S/A", "", df2$CompanyEvent)
df2$cleannames <- gsub("SA", "", df2$cleannames)
df2$cleannames <- gsub("Ltda.", "", df2$cleannames)
df2$cleannames <- gsub(" ", "", df2$cleannames)
df2$cleannames <-tolower(df2$cleannames)
# Get the closest matches and distances -----------------------------------
df2$closestentry <- apply(df2,1, function(x) df1$cleannames[which.min(stringdist(x["cleannames"], df1$cleannames ))] )
df2$levdistance <- apply(df2,1, function(x) min(stringdist(x["cleannames"], df1$cleannames )))
#Get sales mgr data using closest matches
df2$salesmgr <- df1$SalesManager[match(df2$closestentry,df1$cleannames )]
df2
> df2
CompanyEvent cleannames closestentry levdistance salesmgr
1 Customer 1 customer1 customer1 0 Erik
2 Customer 1 SA. customer1. customer1 1 Erik
3 Customer2 customer2 customer2 0 Selma
4 cUSTOIMER 3 custoimer3 customer3 1 Juca
5 Customer 10 customer10 customer1 1 Erik
添加到{{1 }},例如
我在这里所说的距离是指字符串的距离,请参见customer10