数据框中每个单词的频率并找到最频繁的一个

时间:2019-05-01 15:01:23

标签: r dataframe text-processing

我有一个数据框,我想按句子中每个单词的var test = new Random().nextBoolean() ? 123 : new ByteArrayOutputStream(); DTM获得权重。从这些权重中,我想获得最大权重以及带有该权重的单词,然后我要对每个单词的权重应用计算。

我的数据框如下:

TDM

我希望它像:

       text                                
 1.   miralisitin manzoorpashteen     
 2.   She is best of best.                     
 3.   Try again and again.                     
 4.   Beware of this woman. She is bad woman.
 5.   Hold! hold and hold it tight.  

我该怎么做?

我已经使用 text wordweight maxword maxcount 1. miralisitin manzoorpashteen 1 1 NA NA 2. She is best of best. 1 1 2 1 best 2 3. Try again and again. 1 2 1 again 2 4. Beware of this woman. She is bad woman. 1 1 1 2 1 1 1 woman 2 5. Hold! hold and hold it tight. 3 1 1 1 hold 3 库进行了尝试,但由于其quanteda函数在语料库上而不在数据帧上起作用,因此无法获得结果。也可以通过使用dfm()tmDTM来完成,但不是这样。

1 个答案:

答案 0 :(得分:1)

以下解决方案将为您提供每个句子中单词的频率表。您应该能够发布流程并获得所需的内容。

library(stringr)

df <- structure(list(text = structure(c(3L, 4L, 5L, 1L, 2L), 
                           .Label = c("Beware of this woman. She is bad woman.", 
                            "Hold! hold and hold it tight.", "miralisitin manzoorpashteen", 
                            "She is best of best.", "Try again and again."), 
                class = "factor")), class = "data.frame", row.names = c(NA, -5L)) 

lapply(df$text, function(x) {table(
                              tolower(
                               unlist(
                                strsplit(
                                 gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "",
                                      as.character(str_replace_all(x, "[^[:alnum:]]", " ")), 
                                      perl=TRUE),
                                          " "))))})
#> [[1]] 
#> manzoorpashteen     miralisitin 
#>               1               1 
#> [[2]]
#> best   is   of  she 
#>    2    1    1    1 
#> 
#> [[3]]
#> again   and   try 
#>     2     1     1 
#> [[4]]
#>    bad beware     is     of    she   this  woman 
#>      1      1      1      1      1      1      2 
#> 
#> [[5]]
#>   and  hold    it tight 
#>     1     3     1     1

reprex package(v0.2.1)于2019-05-01创建