查找文本与从文本中提取的特定单词之间的相关性

时间:2016-12-07 11:59:52

标签: r correlation text-mining

我正在尝试在文本和从同一文本中提取的几个关键字(基于频率> 200提取)之间进行相关性分析。

我不确定如何使用R。

这是一段数据。主要数据(默认情况下)编码为R

中的级别
head(data)
[1] Call of star wars a halos destiny                                                       
[2] I thought of an new call of duty name CALL OF DUTY: The road of ARK GIANT               
[3] Activation must be destroyed for the sake of video games. Boycott those pieces of shits.
[4] Futuristic˜                                                                          
[5] 1:09 is that the XM 53       

从清理过的文本语料库中分析的关键字很少

head(label)
[1] "2016"    "action"  "activis" "actual"  "alreadi" "also" 

我正在尝试实现一个相关矩阵,该矩阵查看一个单词在文本中的相关性,最后将使用该相关矩阵形成一个网络图来检测社区

但我现在的目标是创建一个像下面的表格或矩阵

                                                star  destroyed  duty 
Call of star wars a halo destiny                  1       0       0
Activation must be destroyed for the sake ....    0       1       0
I thought of new call of duty star                1       0       1

类似地,对于所有数据文本[总共13281行],标记[总共202个单词]

1 个答案:

答案 0 :(得分:0)

假设您只关心数据文本中标签的存在(1)或不存在(0),以下内容对您有用:

data <- c('Call of star wars a halo destiny',
         'I thought of an new call of duty star',
         'Activation must be destroyed for the sake .... ',
         'Futuristic˜',
         '1:09 is that the XM 53')
label <- c("2016","action","activis","actual", "alreadi","also", "star", "destroyed", "duty")

vgrepl <- Vectorize(grepl, 'pattern', SIMPLIFY = TRUE)
df <- +(vgrepl(tolower(label), tolower(data))) # case insensitive
rownames(df) <- data

df

                                                 2016 action activis actual alreadi also star destroyed duty
Call of star wars a halo destiny                   0      0       0      0       0    0    1         0    0
I thought of an new call of duty star              0      0       0      0       0    0    1         0    1
Activation must be destroyed for the sake ....     0      0       0      0       0    0    0         1    0
Futuristic˜                                        0      0       0      0       0    0    0         0    0
1:09 is that the XM 53                             0      0       0      0       0    0    0         0    0