r文本挖掘:寻找字符模式的频率

时间:2015-11-20 00:34:16

标签: r data-mining text-mining

我试图在大型数据集中找到字符模式(字部分)的频率。

例如,我在csv文件中有以下列表:

  • applestrawberrylime
  • applegrapelime
  • pineapplemangoguava
  • kiwiguava
  • grapeapple
  • mixedberry
  • kiwiguavapineapple
  • limemixedberry

有没有办法找到所有字符组合的频率?像:

  • appleberry
  • guava
  • applestrawberry
  • kiwiguava
  • grapeapple
  • 吸管
  • app
  • AP
  • 假发
  • mem

更新:这就是我在数据中查找长度为3的所有字符模式的频率的原因:

threecombo  <- do.call(paste0,expand.grid(rep(list(c('a', 'b', 'c', 'd','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')), 3)))

threecompare<-sapply(threecombo, function(x) length(grep(x, myData)))

代码以我想要的方式工作,我想重复上述步骤以获得更长的字符长度(4,5,6等),但运行需要一段时间。有没有更好的方法呢?

2 个答案:

答案 0 :(得分:2)

由于您可能正在寻找包含非水果单词的一组文本中的水果口味组合,因此我编写了一些类似于您示例中的文档。我已经使用 quanteda 包构建了一个文档术语矩阵,然后根据包含水果单词的ngram进行过滤。

docs <- c("One flavor is apple strawberry lime.", 
          "Another flavor is apple grape lime.", 
          "Pineapple mango guava is our newest flavor.",
          "There is also kiwi guava and grape apple.", 
          "Mixed berry was introduced last year.", 
          "Did you like kiwi guava pineapple?",
          "Try the lime mixed berry.")
flavorwords <- c("apple", "guava", "berry", "kiwi", "guava", "grape")

require(quanteda)
# form a document-feature matrix ignoring common stopwords + "like"
# for ngrams, bigrams, trigrams
fruitDfm <- dfm(docs, ngrams = 1:3, ignoredFeatures = c("like", "also", stopwords("english")))
## Creating a dfm from a character vector ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 7 documents
##    ... indexing features: 90 feature types
##    ... removed 47 features, from 176 supplied (glob) feature types
##    ... complete. 
##    ... created a 7 x 40 sparse dfm
## Elapsed time: 0.01 seconds.
# select only those features containing flavorwords as regular expression
fruitDfm <- selectFeatures(fruitDfm, flavorwords, valuetype = "regex")
## kept 22 features, from 5 supplied (regex) feature types
# show the features
topfeatures(fruitDfm, nfeature(fruitDfm))
##                apple                 guava                 grape             pineapple                  kiwi 
##                    3                     3                     2                     2                     2 
##           kiwi_guava                 berry           mixed_berry            strawberry      apple_strawberry 
##                    2                     2                     2                     1                     1 
##      strawberry_lime apple_strawberry_lime           apple_grape            grape_lime      apple_grape_lime 
##                    1                     1                     1                     1                     1 
##      pineapple_mango           mango_guava pineapple_mango_guava           grape_apple       guava_pineapple 
##                    1                     1                     1                     1                     1 
## kiwi_guava_pineapple      lime_mixed_berry 
##                    1                     1 

<强>加了:

如果您希望将未用空格分隔的术语与文档匹配,则可以使用空字符串连接符形成ngrams,并匹配如下。

flavorwordsConcat <- c("applestrawberrylime", "applegrapelime", "pineapplemangoguava",
                       "kiwiguava", "grapeapple", "mixedberry", "kiwiguavapineapple",
                       "limemixedberry")

fruitDfm <- dfm(docs, ngrams = 1:3, concatenator = "")
fruitDfm <- fruitDfm[, features(fruitDfm) %in% flavorwordsConcat]
fruitDfm
# Document-feature matrix of: 7 documents, 8 features.
# 7 x 8 sparse Matrix of class "dfmSparse"
#        features
# docs  applestrawberrylime applegrapelime pineapplemangoguava kiwiguava grapeapple mixedberry kiwiguavapineapple limemixedberry
# text1                   1              0                   0         0          0          0                  0              0
# text2                   0              1                   0         0          0          0                  0              0
# text3                   0              0                   1         0          0          0                  0              0
# text4                   0              0                   0         1          1          0                  0              0
# text5                   0              0                   0         0          0          1                  0              0
# text6                   0              0                   0         1          0          0                  1              0
# text7                   0              0                   0         0          0          1                  0              1

如果您的文字包含连接的风味词,那么您可以使用

将unigram dfm与单个水果词的所有trigram排列相匹配
unigramFlavorWords <- c("apple", "guava", "grape", "pineapple", "kiwi")
head(unlist(combinat::permn(unigramFlavorWords, paste, collapse = "")))
[1] "appleguavagrapepineapplekiwi" "appleguavagrapekiwipineapple" "appleguavakiwigrapepineapple" 
[4] "applekiwiguavagrapepineapple" "kiwiappleguavagrapepineapple" "kiwiappleguavapineapplegrape"

答案 1 :(得分:1)

对于ReviewFormId / grep,您的初始问题是一项简单的任务,我看到您已将此部分答案纳入修订后的问题。

grepl

如果你想检查每种可能的模式,你可以搜索每个字母组合(当你上面开始做的时候),但这显然是很长的路要走。

一种策略是仅计算实际发生的每个模式的频率。每个字符长度docs <- c('applestrawberrylime', 'applegrapelime', 'pineapplemangoguava', 'kiwiguava', 'grapeapple', 'mixedberry', 'kiwiguavapineapple', 'limemixedberry') patterns <- c('appleberry', 'guava', 'applestrawberry', 'kiwiguava', 'grapeapple', 'grape', 'app', 'ap', 'wig', 'mem', 'go') # how often does each pattern occur in the set of docs? sapply(patterns, function(x) sum(grepl(x, docs))) 的文档都有1种可能的长度n模式,2种长度为n的模式,依此类推。你可以提取每一个,然后计算它们。

n - 1

这可以很快地运行和运行,但随着文档的语料库变长,你可能会陷入困境(即使在这个简单的例子中,也有625种独特的模式)。可以对所有all_patterns <- lapply(docs, function(x) { # individual chars in this doc chars <- unlist(strsplit(x, '')) # unique possible sequence lengths seqs <- sapply(1:nchar(x), seq) # each sequence in each position sapply(seqs, function(y) { start_pos <- 0:(nchar(x) - max(y)) sapply(start_pos, function(z) paste(chars[z + y], collapse='')) }) }) unq_patterns <- unique(unlist(all_patterns)) # how often does each unique pattern occur in the set of docs? occur <- sapply(unq_patterns, function(x) sum(grepl(x, docs))) # top 25 most frequent patterns sort(occur, decreasing = T)[1:25] # e i a l p r m ap pp pl le app ppl # 7 7 6 6 5 5 5 5 5 5 5 5 5 # ple appl pple apple g w b y ra be er rr # 5 5 5 5 5 3 3 3 3 3 3 3 次呼叫使用并行处理,但仍然......