我正在尝试在文本和从同一文本中提取的几个关键字(基于频率> 200提取)之间进行相关性分析。
我不确定如何使用R。
这是一段数据。主要数据(默认情况下)编码为R
中的级别head(data)
[1] Call of star wars a halos destiny
[2] I thought of an new call of duty name CALL OF DUTY: The road of ARK GIANT
[3] Activation must be destroyed for the sake of video games. Boycott those pieces of shits.
[4] Futuristic˜
[5] 1:09 is that the XM 53
从清理过的文本语料库中分析的关键字很少
head(label)
[1] "2016" "action" "activis" "actual" "alreadi" "also"
我正在尝试实现一个相关矩阵,该矩阵查看一个单词在文本中的相关性,最后将使用该相关矩阵形成一个网络图来检测社区
但我现在的目标是创建一个像下面的表格或矩阵
star destroyed duty
Call of star wars a halo destiny 1 0 0
Activation must be destroyed for the sake .... 0 1 0
I thought of new call of duty star 1 0 1
类似地,对于所有数据文本[总共13281行],标记[总共202个单词]
答案 0 :(得分:0)
假设您只关心数据文本中标签的存在(1)或不存在(0),以下内容对您有用:
data <- c('Call of star wars a halo destiny',
'I thought of an new call of duty star',
'Activation must be destroyed for the sake .... ',
'Futuristic˜',
'1:09 is that the XM 53')
label <- c("2016","action","activis","actual", "alreadi","also", "star", "destroyed", "duty")
vgrepl <- Vectorize(grepl, 'pattern', SIMPLIFY = TRUE)
df <- +(vgrepl(tolower(label), tolower(data))) # case insensitive
rownames(df) <- data
df
2016 action activis actual alreadi also star destroyed duty
Call of star wars a halo destiny 0 0 0 0 0 0 1 0 0
I thought of an new call of duty star 0 0 0 0 0 0 1 0 1
Activation must be destroyed for the sake .... 0 0 0 0 0 0 0 1 0
Futuristic˜ 0 0 0 0 0 0 0 0 0
1:09 is that the XM 53 0 0 0 0 0 0 0 0 0