我有一个由年份命名的39个文本文件的语料库 - 1945.txt,1978.txt .... 2013.txt。
我已将它们导入R并使用TM包创建了Document Term Matrix。 我正在努力调查1945年至2013年间与“欺诈”相关的词语多年来的变化情况。 期望的输出将是39乘10/5矩阵,其中多年为行标题,前10或5项为列。
非常感谢任何帮助。
提前致谢。
我的TDM的结构:
> str(ytdm)
List of 6
$ i : int [1:6791] 5 7 8 17 32 41 42 55 58 71 ...
$ j : int [1:6791] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:6791] 2 4 2 2 2 8 4 3 2 2 ...
$ nrow : int 193
$ ncol : int 39
$ dimnames:List of 2
..$ Terms: chr [1:193] "abus" "access" "account" "accur" ...
..$ Docs : chr [1:39] "1947" "1976" "1977" "1978" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
My ideal output is like this:
1947 account accur gao medicine fed ......
1948 access .............
.
.
.
.
.
.
答案 0 :(得分:3)
您的示例无法复制,但findAssocs()可能就是您要查找的内容。由于您只想每年查看员工,因此每年需要一个dtm。
> library(tm)
> data(crude)
> # i don't have your data so pretend this is corpus of docs for each year
> names(crude) <- rep(c("1999","2000"),10)
> # create a dtm for each year
> dtm.list <- lapply(unique(names(crude)),function(x) TermDocumentMatrix(crude[names(crude)==x]))
> # get associations for each year
> assoc.list <- lapply(dtm.list,findAssocs,term="oil",corlimit=0.7)
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
prices barrel.
0.79 0.70
$`2000`
15.8 opec and said prices, sell the analysts clearly fixed
0.94 0.94 0.92 0.92 0.91 0.91 0.88 0.85 0.85 0.85
late meeting never that trying who winter emergency above but
0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.84 0.83 0.83
world they mln market agreement before bpd buyers energy prices
0.82 0.80 0.79 0.78 0.75 0.75 0.75 0.75 0.75 0.75
set through under will not its
0.75 0.75 0.75 0.74 0.72 0.70
> # or if you want the 5 top terms
> assoc.list <- lapply(dtm.list,function(x) names(findAssocs(x,"oil",0)[1:5]))
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
[1] "prices" "barrel." "said." "minister" "arabian"
$`2000`
[1] "15.8" "opec" "and" "said" "prices,"