which R function to use for Text Auto-Correction?

时间:2015-04-23 05:14:33

标签: r dplyr levenshtein-distance tm agrep

I have a csv Document with 2 columns which contains Commodity Category and Commodity Name.

Ex:

Sl.No. Commodity Category Commodity Name
1      Stationary         Pencil
2      Stationary         Pen
3      Stationary         Marker
4      Office Utensils    Chair
5      Office Utensils    Drawer
6      Hardware           Monitor
7      Hardware           CPU

and I have another csv file which contains various Commodity names.

Ex:

Sl.No. Commodity Name
1      Pancil
2      Pencil-HB 02
3      Pencil-Apsara
4      Pancil-Nataraj
5      Pen-Parker
6      Pen-Reynolds
7      Monitor-X001RL

The output I would like is to standardise and categorise the commodity names and classify them into respective Commodity Categories like shown below :

Sl.No. Commodity Name   Commodity Category
1      Pencil           Stationary
2      Pencil           Stationary
3      Pencil           Stationary
4      Pancil           Stationary
5      Pen              Stationary
6      Pen              Stationary
7      Monitor          Hardware

Step 1) I first have to use NLTK (Text mining methods) and clean the data so as to seperate "Pencil" from "Pencil-HB 02" .

Step 2) After cleaning I have to use Approximate String match technique i.e agrep() to match the patterns "Pencil *" or correcting "Pancil" to "Pencil".

Step 3)Once correcting the pattern I have to categorise. No idea how.

This is what I have thought about. I started with step 2 and I'm stuck in step 2 only. I'm not finding an exact method to code this. Is there any way to get the output as required? If yes please suggest me the method I can proceed with.

2 个答案:

答案 0 :(得分:3)

您可以使用stringdist包。以下correct功能会根据项目与不同Commodity.Name的距离更正file2中的CName

然后使用left_join连接两个表。

我还注意到,如果我使用stringdistmatrix的默认选项,则会有一些分类。您可以尝试更改weight的{​​{1}}参数,以获得更好的更正结果。

stringdistmatrix

如果您需要&#34;其他&#34;类别,你只需要玩权重。 我添加了一排&#34; Diesel&#34;在file2中。然后使用> library(dplyr) > library(stringdist) > > file1 <- read.csv("/Users/Randy/Desktop/file1.csv") > file2 <- read.csv("/Users/Randy/Desktop/file2.csv") > > head(file1) Sl.No. Commodity.Category Commodity.Name 1 1 Stationary Pencil 2 2 Stationary Pen 3 3 Stationary Marker 4 4 Office Utensils Chair 5 5 Office Utensils Drawer 6 6 Hardware Monitor > head(file2) Sl.No. Commodity.Name 1 1 Pancil 2 2 Pencil-HB 02 3 3 Pencil-Apsara 4 4 Pancil-Nataraj 5 5 Pen-Parker 6 6 Pen-Reynolds > > CName <- levels(file1$Commodity.Name) > correct <- function(x){ + factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName) + } > > correctedfile2 <- file2 %>% + transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name)) > > correctedfile2 %>% + inner_join(file1[,-1], by="Commodity.Name") Commodity.Name.Old Commodity.Name Commodity.Category 1 Pancil Pencil Stationary 2 Pencil-HB 02 Pencil Stationary 3 Pencil-Apsara Pencil Stationary 4 Pancil-Nataraj Pencil Stationary 5 Pen-Parker Pen Stationary 6 Pen-Reynolds Pen Stationary 7 Monitor-X001RL Monitor Hardware 使用自定义权重计算得分(您应该尝试更改值)。如果分数大于2(此值与分配权重的方式有关),则不会更正任何内容。

PS:由于我们不知道所有可能的标签,因此我们必须stringdistas.character转换为factor

PS2:我也使用character进行不区分大小写的评分。

tolower

答案 1 :(得分:0)

有一个&#39;近似字符串匹配&#39; amatch()中的函数{stingdist}(至少在0.9.4.6中),它从预定义的单词集中返回最可能的匹配。它有一个参数maxDist可以设置为匹配的最大距离,nomatch参数可以用于其他&#39;其他&#39}。类别。否则,方法,权重等可以与stringdistmatrix()类似地设置。

因此,您可以使用tidyverse兼容解决方案解决原始问题:

library(dplyr)
library(stringdist)

# Reading the files
file1 <- readr::read_csv("file1.csv")
file2 <- readr::read_csv("file2.csv")

# Getting the commodity names in a vector    
commodities <- file1 %>% distinct(`Commodity Name`) %>% pull()

# Finding the closest string match of the commodities, and joining the file containing the categories 
file2 %>%
    mutate(`Commodity Name` = commodities[amatch(`Commodity Name`, commodities, maxDist = 5)]) %>%
    left_join(file1, by = "Commodity Name")

这将返回包含更正的商品名称和类别的数据框。如果原始Commodity name距离任何可能的商品名称超过5个字符(字符串距离的简化说明),则更正后的名称将为NA。