R函数根据字母查找单词的子集

时间:2017-02-18 13:46:02

标签: r statistics nlp

我正在寻找一种方法来找到一种方法,从包含特定字母的单词列表中创建单词子集。

现在我知道我可以使用grepexpr函数来查找单词中是否存在字母,但是我无法创建包含特定字母的单词子集。

我已经能够在单词列表中找到字母总数:

> letters_table2<-table(unlist(strsplit(newdata2, ""), use.names=FALSE))
> letters_table2

 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z 
14  9 11  8 11  6  4  7 12  3  3  9 14  7  9  8  6 13 13  6  7  8  4  7  8  3 

我想从newdata2创建一个只包含a,b,c等的单词列表。

    newdata2
  [1] "ae" "aj" "al" "an" "av" "av" "ay" "ba" "bd" "bd" "bk" "bl" "bv" "ca" "cl" "cm" "co"
 [18] "cr" "cy" "dh" "dl" "dm" "ea" "ec" "ef" "er" "ex" "ex" "ez" "fm" "fo" "ft" "gi" "gy"
 [35] "hb" "hm" "hr" "hr" "hs" "id" "in" "io" "iq" "ir" "ir" "it" "iz" "ja" "js" "kn" "lc"
 [52] "ld" "le" "lp" "ls" "me" "mg" "mh" "mi" "mi" "mm" "mo" "ms" "nf" "nw" "ny" "ok" "op"
 [69] "ox" "pa" "pi" "pr" "ps" "ps" "py" "qc" "qf" "qm" "qu" "qy" "rn" "rr" "rs" "rt" "ru"
 [86] "sa" "so" "ss" "ts" "uc" "us" "uu" "ux" "vb" "vc" "vv" "vw" "wb" "wg" "xe" "xo" "xt"
[103] "yd" "yt" "za"

1 个答案:

答案 0 :(得分:1)

我建议:

setNames(lapply(letters, function(y) grep(y, x, value = TRUE)), letters)

这是一个简单的例子,只使用5个字母而不是全部26个。

set.seed(1)
mydata <- paste0(sample(letters[1:5], 15, TRUE), 
                 sample(letters[1:5], 15, TRUE))
table(unlist(strsplit(mydata, ""), use.names = FALSE))
## 
##  a  b  c  d  e 
##  4 11  2  7  6 
setNames(lapply(letters[1:5], function(y) {
  grep(y, mydata, value = TRUE)
}), letters[1:5])
## $a
## [1] "da" "ab" "aa"
## 
## $b
##  [1] "bc" "bd" "eb" "bd" "eb" "ab" "bb" "db" "be" "db"
## 
## $c
## [1] "bc" "ce"
## 
## $d
## [1] "bd" "bd" "dd" "da" "db" "db"
## 
## $e
## [1] "ce" "eb" "ee" "eb" "be"
##