Question

我有一个由年份命名的39个文本文件的语料库 - 1945.txt，1978.txt .... 2013.txt。

我已将它们导入R并使用TM包创建了Document Term Matrix。我正在努力调查1945年至2013年间与“欺诈”相关的词语多年来的变化情况。期望的输出将是39乘10/5矩阵，其中多年为行标题，前10或5项为列。

非常感谢任何帮助。

提前致谢。

我的TDM的结构：

> str(ytdm)
List of 6
 $ i       : int [1:6791] 5 7 8 17 32 41 42 55 58 71 ...
 $ j       : int [1:6791] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:6791] 2 4 2 2 2 8 4 3 2 2 ...
 $ nrow    : int 193
 $ ncol    : int 39
 $ dimnames:List of 2
  ..$ Terms: chr [1:193] "abus" "access" "account" "accur" ...
  ..$ Docs : chr [1:39] "1947" "1976" "1977" "1978" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

My ideal output is like this:


1947   account accur gao medicine fed ......
1948   access  .............
.
.
.
.
.
.

Answer 1

您的示例无法复制，但findAssocs（）可能就是您要查找的内容。由于您只想每年查看员工，因此每年需要一个dtm。

> library(tm)
> data(crude)
> # i don't have your data so pretend this is corpus of docs for each year
> names(crude) <- rep(c("1999","2000"),10)
> # create a dtm for each year
> dtm.list <- lapply(unique(names(crude)),function(x) TermDocumentMatrix(crude[names(crude)==x]))
> # get associations for each year
> assoc.list <- lapply(dtm.list,findAssocs,term="oil",corlimit=0.7)
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
 prices barrel. 
   0.79    0.70 

$`2000`
     15.8      opec       and      said   prices,      sell       the  analysts   clearly     fixed 
     0.94      0.94      0.92      0.92      0.91      0.91      0.88      0.85      0.85      0.85 
     late   meeting     never      that    trying       who    winter emergency     above       but 
     0.85      0.85      0.85      0.85      0.85      0.85      0.85      0.84      0.83      0.83 
    world      they       mln    market agreement    before       bpd    buyers    energy    prices 
     0.82      0.80      0.79      0.78      0.75      0.75      0.75      0.75      0.75      0.75 
      set   through     under      will       not       its 
     0.75      0.75      0.75      0.74      0.72      0.70 

> # or if you want the 5 top terms
> assoc.list <- lapply(dtm.list,function(x) names(findAssocs(x,"oil",0)[1:5]))
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
[1] "prices"   "barrel."  "said."    "minister" "arabian" 

$`2000`
[1] "15.8"    "opec"    "and"     "said"    "prices,"

R：在R中的文档术语矩阵中查找与“欺诈”一词相关的前10个术语

1 个答案: