I have a csv Document with 2 columns which contains Commodity Category and Commodity Name.
Ex:
Sl.No. Commodity Category Commodity Name
1 Stationary Pencil
2 Stationary Pen
3 Stationary Marker
4 Office Utensils Chair
5 Office Utensils Drawer
6 Hardware Monitor
7 Hardware CPU
and I have another csv file which contains various Commodity names.
Ex:
Sl.No. Commodity Name
1 Pancil
2 Pencil-HB 02
3 Pencil-Apsara
4 Pancil-Nataraj
5 Pen-Parker
6 Pen-Reynolds
7 Monitor-X001RL
The output I would like is to standardise and categorise the commodity names and classify them into respective Commodity Categories like shown below :
Sl.No. Commodity Name Commodity Category
1 Pencil Stationary
2 Pencil Stationary
3 Pencil Stationary
4 Pancil Stationary
5 Pen Stationary
6 Pen Stationary
7 Monitor Hardware
Step 1) I first have to use NLTK (Text mining methods) and clean the data so as to seperate "Pencil" from "Pencil-HB 02" .
Step 2) After cleaning I have to use Approximate String match technique i.e agrep() to match the patterns "Pencil *" or correcting "Pancil" to "Pencil".
Step 3)Once correcting the pattern I have to categorise. No idea how.
This is what I have thought about. I started with step 2 and I'm stuck in step 2 only. I'm not finding an exact method to code this. Is there any way to get the output as required? If yes please suggest me the method I can proceed with.
答案 0 :(得分:3)
您可以使用stringdist
包。以下correct
功能会根据项目与不同Commodity.Name
的距离更正file2中的CName
。
然后使用left_join
连接两个表。
我还注意到,如果我使用stringdistmatrix
的默认选项,则会有一些分类。您可以尝试更改weight
的{{1}}参数,以获得更好的更正结果。
stringdistmatrix
如果您需要"其他"类别,你只需要玩权重。
我添加了一排" Diesel"在file2中。然后使用> library(dplyr)
> library(stringdist)
>
> file1 <- read.csv("/Users/Randy/Desktop/file1.csv")
> file2 <- read.csv("/Users/Randy/Desktop/file2.csv")
>
> head(file1)
Sl.No. Commodity.Category Commodity.Name
1 1 Stationary Pencil
2 2 Stationary Pen
3 3 Stationary Marker
4 4 Office Utensils Chair
5 5 Office Utensils Drawer
6 6 Hardware Monitor
> head(file2)
Sl.No. Commodity.Name
1 1 Pancil
2 2 Pencil-HB 02
3 3 Pencil-Apsara
4 4 Pancil-Nataraj
5 5 Pen-Parker
6 6 Pen-Reynolds
>
> CName <- levels(file1$Commodity.Name)
> correct <- function(x){
+ factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> correctedfile2 %>%
+ inner_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Pancil Pencil Stationary
2 Pencil-HB 02 Pencil Stationary
3 Pencil-Apsara Pencil Stationary
4 Pancil-Nataraj Pencil Stationary
5 Pen-Parker Pen Stationary
6 Pen-Reynolds Pen Stationary
7 Monitor-X001RL Monitor Hardware
使用自定义权重计算得分(您应该尝试更改值)。如果分数大于2(此值与分配权重的方式有关),则不会更正任何内容。
PS:由于我们不知道所有可能的标签,因此我们必须stringdist
将as.character
转换为factor
。
PS2:我也使用character
进行不区分大小写的评分。
tolower
答案 1 :(得分:0)
有一个&#39;近似字符串匹配&#39; amatch()
中的函数{stingdist}
(至少在0.9.4.6中),它从预定义的单词集中返回最可能的匹配。它有一个参数maxDist
可以设置为匹配的最大距离,nomatch
参数可以用于其他&#39;其他&#39}。类别。否则,方法,权重等可以与stringdistmatrix()
类似地设置。
因此,您可以使用tidyverse兼容解决方案解决原始问题:
library(dplyr)
library(stringdist)
# Reading the files
file1 <- readr::read_csv("file1.csv")
file2 <- readr::read_csv("file2.csv")
# Getting the commodity names in a vector
commodities <- file1 %>% distinct(`Commodity Name`) %>% pull()
# Finding the closest string match of the commodities, and joining the file containing the categories
file2 %>%
mutate(`Commodity Name` = commodities[amatch(`Commodity Name`, commodities, maxDist = 5)]) %>%
left_join(file1, by = "Commodity Name")
这将返回包含更正的商品名称和类别的数据框。如果原始Commodity name
距离任何可能的商品名称超过5个字符(字符串距离的简化说明),则更正后的名称将为NA。