使用R在单词中对相同模式进行分类

时间:2018-09-15 15:41:09

标签: r dplyr tm fuzzy-search

我想进行文本挖掘分析,但遇到任何麻烦。 使用dput(),我只加载文本的一小部分。

for (unsigned int l = 0; l < scene->mNumMeshes; l++)

(NA是偶然的。) 正文是支票中产品的名称。

我想对任何相似的名称进行分组。

例如。在这里,我手动采用MAKFA makar(乌克兰名称)。我发现text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L, 3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L, 3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME = structure(c(19L, 17L, 15L, 18L, 16L, 23L, 21L, 14L, 22L, 20L, 6L, 2L, 10L, 8L, 7L, 13L, 5L, 11L, 7L, 12L, 4L, 3L, 9L, 9L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = c("", "* 2108609 SLOB.Mayon.OLIVK.67% 400ml", "* 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg", "* 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35", "* 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g", "197 Onion 1 kg", "2013077 MAKFA Makar.RAKERS 450g", "2030918 MARIA TRADITIONAL Biscuit 180g", "2049750 MAKFA Makar.SHIGHTS 450g", "3420159 LEBED.Mol.past.3,4-4,5% 900g", "3491144 LIP.NAP.ICE TEA green yellow 0.5 liter", "6788 MAKFA Makar.perya 450g", "809 Bananas 1kg", "FetaXa Cheese product 60% 400g (", "Lemons 55+", "MAKFA Macaroni feathers like. in / with", "Napkins paper color 100pcs PL", "Package \"Magnet\" white (Plastiktre)", "Pasta Makfa snail flow-pack 450 g.", "SHEBEKINSKIE Macaroni Butterfly №40", "SOFT Cotton sticks 100 PE (BELL", "TENDER AGE Cottage cheese 10", "TOBUS steering-wheel 0.5kg flow" ), class = "factor")), .Names = c("ID_C_REGCODES_CASH_VOUCHER", "GOODS_NAME"), class = "data.frame", row.names = c(NA, -61L))

有7行
"root or key word MAKFA Makar"

所有产品位置均具有相同的词根。 MAKFA Makar不能像Pasta Makfa snail flow-pack 450 g. MAKFA Macaroni feathers like. in / with 2013077 MAKFA Makar.RAKERS 450g 2013077 MAKFA Makar.RAKERS 450g 6788 MAKFA Makar.perya 450g 2049750 MAKFA Makar.SHIGHTS 450g 2049750 MAKFA Makar.SHIGHTS 450g 这样 作为输出,我想得到

MFAMKR

我该如何通过词根对产品进行分类?(相反,Makar.Makfa,cheese等词中存在相同的模式)

2 个答案:

答案 0 :(得分:2)

我认为您可以通过清洗然后将文本聚类而到达所需的位置-这是一个入门工具:

text <- text[1:24,]
library(quanteda)
library(tidyverse)
hc <- text %>% 
  pull(GOODS_NAME) %>% 
  as.character %>% 
  quanteda::tokens(
    remove_numbers = T,  
    remove_punct = T,
    remove_symbols = T, 
    remove_separators = T
  ) %>% 
  quanteda::tokens_tolower() %>% 
  quanteda::tokens_remove(valuetype="regex", pattern = c("^\\d.*")) %>% 
  quanteda::dfm() %>% 
  textstat_simil(method = "jaccard") %>% 
  magrittr::multiply_by(-1) %>% 
  `attr<-`("Labels", text$GOODS_NAME) %>% 
  hclust(method = "average") 

pdf(tf<-tempfile(fileext = ".pdf"), width = 20, height = 10)
plot(hc)
dev.off()
shell.exec(tf)

clusters <- cutree(hc, h = -0.1)
split(text, clusters)

答案 1 :(得分:2)

这是一种可以在其中搜索单词的向量的方法:

patt <- c("MAKFA Makar.", "kolb","Spikachki", "Bananas", "Lemons",
"Napkins paper", "Cotton sticks","SHEBEKINSKIE Macaroni","CAT seed","Cheese",
"TEA", "Biscuit", "Onion", "steering-wheel", "Package  (Plastiktre)",
"Mayon", "Cottage", "cheese")

lst <-lapply(patt, function(x) text[grep(x,text$GOODS_NAME), ])
do.call(rbind.data.frame, lst)

   ID_C_REGCODES_CASH_VOUCHER                                              GOODS_NAME
15                       3953                         2013077 MAKFA Makar.RAKERS 450g
19                       3960                         2013077 MAKFA Makar.RAKERS 450g
20                       3960                             6788 MAKFA Makar.perya 450g
23                       3967                        2049750 MAKFA Makar.SHIGHTS 450g
24                       3967                        2049750 MAKFA Makar.SHIGHTS 450g
22                       3960              * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg
16                       3953                                         809 Bananas 1kg
3                        3941                                              Lemons 55+
2                        3941                           Napkins paper color 100pcs PL
7                        3945                         SOFT Cotton sticks 100 PE (BELL
10                       3945                     SHEBEKINSKIE Macaroni Butterfly №40
17                       3960 * 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g
8                        3945                        FetaXa Cheese product 60% 400g (
18                       3960          3491144 LIP.NAP.ICE TEA green yellow 0.5 liter
14                       3953                  2030918 MARIA TRADITIONAL Biscuit 180g
11                       3953                                          197 Onion 1 kg
6                        3945                         TOBUS steering-wheel 0.5kg flow
12                       3953                    * 2108609 SLOB.Mayon.OLIVK.67% 400ml
9                        3945                            TENDER AGE Cottage cheese 10
91                       3945                            TENDER AGE Cottage cheese 10