比较模糊R

时间:2018-11-01 12:49:37

标签: r dataframe data-mining data-science text-mining

我有两个数据集,数据集df1的一列包含在我们的CRM中注册的公司名称,另一列的包含销售经理的名称。数据集df2的一列中包含访问过IT事件的公司的名称。

由于数据集df2是由参与者手动输入的,因此写入时会出现拼写错误,缩写等。也就是说,与CRM中注册的公司名称相似。

因此,目标是将访问数据集中df2中事件的公司的名称与在数据集中df1中注册的公司的名称进行比较,并将这些比较分配给销售经理。当然,找不到或比较远的名称应具有销售员的NA值。

我是R的新手,我正在尝试各种尝试,但收效甚微。

您能帮我建立这个脚本吗?

下面是示例:

                 df1                                 df2  
    |----------------|----------------|       |----------------|
    |    Company     |  Sales Manager |       | Company Event  |
    |----------------|----------------|       |----------------|
    |Customer 1 SA   |Erik            |       |Customer 1      |
    |Customer 2 S\A  |Selma           |       |Customer 1 SA.  |
    |Customer 3 Ltda.|Juca            |       |Customer2       |
    |Customer 4      |Batista         |       |cUSTOIMER 3     |
    |----------------|----------------|       |Customer 10     |
                                              |----------------|

预期的最终结果是将另一个具有交叉数据的df。

                             matched df  
        |----------------|----------------|----------------|
        | Company Event  |    Company     | Sales Manager  |
        |----------------|----------------|----------------|
        |Customer 1      |Customer 1 SA   |Erik            |
        |Customer 1 SA.  |Customer 1 SA   |Erik            |
        |Customer2       |Customer 2 S\A  |Selma           |
        |cUSTOIMER 3     |Customer 3 Ltda.|Juca            |
        |Customer 10     |NA              |NA              |
        |----------------|----------------|----------------|

1 个答案:

答案 0 :(得分:-1)

以下应该起作用。它涉及清理名称,获取最短距离,然后获取销售经理信息。

 onFileSelected(event) {

this.selectedFile = event.target.file[0];

模糊字符串匹配是..很好,很模糊,因此您可能遇到的情况不是您所期望的,但是经过一些调整后您应该会满意(这里将library(stringdist) # declare data ------------------------------------------------------------ Company <- c("Customer 1 SA" ,"Customer 2 S/A", "Customer 3 Ltda.", "Customer 4") SalesManager <- c("Erik", "Selma", "Juca", "Batista") CompanyEvent <- c("Customer 1", "Customer 1 SA.", "Customer2" , "cUSTOIMER 3", "Customer 10") df1 <- data.frame(Company, SalesManager, stringsAsFactors = F) df2 <- data.frame(CompanyEvent, stringsAsFactors = F) # clean 'dirty' names ----------------------------------------------------- df1$cleannames <- gsub("S/A", "", df1$Company) df1$cleannames <- gsub("SA", "", df1$cleannames) df1$cleannames <- gsub("Ltda.", "", df1$cleannames) df1$cleannames <- gsub(" ", "", df1$cleannames) df1$cleannames <-tolower(df1$cleannames) df2$cleannames <- gsub("S/A", "", df2$CompanyEvent) df2$cleannames <- gsub("SA", "", df2$cleannames) df2$cleannames <- gsub("Ltda.", "", df2$cleannames) df2$cleannames <- gsub(" ", "", df2$cleannames) df2$cleannames <-tolower(df2$cleannames) # Get the closest matches and distances ----------------------------------- df2$closestentry <- apply(df2,1, function(x) df1$cleannames[which.min(stringdist(x["cleannames"], df1$cleannames ))] ) df2$levdistance <- apply(df2,1, function(x) min(stringdist(x["cleannames"], df1$cleannames ))) #Get sales mgr data using closest matches df2$salesmgr <- df1$SalesManager[match(df2$closestentry,df1$cleannames )] df2 > df2 CompanyEvent cleannames closestentry levdistance salesmgr 1 Customer 1 customer1 customer1 0 Erik 2 Customer 1 SA. customer1. customer1 1 Erik 3 Customer2 customer2 customer2 0 Selma 4 cUSTOIMER 3 custoimer3 customer3 1 Juca 5 Customer 10 customer10 customer1 1 Erik 添加到{{1 }},例如

我在这里所说的距离是指字符串的距离,请参见customer10