我创建了一个data.frame,用于保存我的单词及其频率。现在我想对我的框架的每一行做一个findAssocs,但我无法让我的代码工作。任何帮助表示赞赏。
以下是我的data.frame term.df
的示例
term.df <- data.frame(word = names(v),freq=v)
&#13;
word freq
ounce 8917
pack 6724
count 4992
organic 3696
frozen 2534
free 1728
&#13;
我创建了一个TermDocumentMatrix tdm ,以下代码按预期工作。
findAssocs(tdm, 'frozen', 0.20)
&#13;
我想将findAssocs的输出附加为新列
这是我尝试过的代码:
library(dplyr)
library(tm)
library(pbapply)
#I would like to append all findings in a new column
res <- merge(do.call(rbind.data.frame, pblapply(term.df, findAssocs(tdm, term.df$word , 0.18))),
term.df[, c("word")], by.x="list.q", by.y="word", all.x=TRUE)
&#13;
编辑: 至于输出。上面的单个陈述让我得到这样的东西。
$yogurt
greek ellenos fat chobani dannon fage yoplait nonfat wallaby
0.62 0.36 0.25 0.24 0.24 0.24 0.24 0.22 0.20
&#13;
我希望可以在我的原始表(ASSOC)中添加一个列,并将结果作为逗号分隔的名称:值元组,但我真的很开心。
答案 0 :(得分:1)
我认为最简单的结构是嵌套列表:
lapply(seq_len(nrow(text.df)), function(i) {
list(word=text.df$word[i],
freq=text.df$freq[i],
assoc=findAssocs(tdm, as.character(text.df$word[i]), 0.7)[[1]])
})
# [[1]]
# [[1]]$word
# [1] "oil"
#
# [[1]]$freq
# [1] 3
#
# [[1]]$assoc
# 15.8 opec clearly late trying who winter analysts
# 0.87 0.87 0.80 0.80 0.80 0.80 0.80 0.79
# said meeting above emergency market fixed that prices
# 0.78 0.77 0.76 0.75 0.75 0.73 0.73 0.72
# agreement buyers
# 0.71 0.70
#
#
# [[2]]
# [[2]]$word
# [1] "opec"
#
# [[2]]$freq
# [1] 2
#
# [[2]]$assoc
# meeting emergency oil 15.8 analysts buyers above
# 0.88 0.87 0.87 0.85 0.85 0.83 0.82
# said ability they prices. agreement but clearly
# 0.82 0.80 0.80 0.79 0.76 0.74 0.74
# december. however, late production sell trying who
# 0.74 0.74 0.74 0.74 0.74 0.74 0.74
# winter quota that through bpd market
# 0.74 0.73 0.73 0.73 0.70 0.70
#
#
# [[3]]
# [[3]]$word
# [1] "xyz"
#
# [[3]]$freq
# [1] 1
#
# [[3]]$assoc
# numeric(0)
根据我的经验,这比嵌套字符串更容易处理,因为您仍然可以通过访问输出列表中的相应元素来访问原始text.df
对象的每一行的单词关联。
如果您真的想保留数据框架结构,那么您可以非常轻松地将findAssocs
输出转换为字符串表示形式,例如使用toJSON
:
library(RJSONIO)
text.df$assoc <- sapply(text.df$word, function(x) toJSON(findAssocs(tdm, x, 0.7)[[1]], collapse=""))
text.df
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1
# assoc
# 1 { "15.8": 0.87,"opec": 0.87,"clearly": 0.8,"late": 0.8,"trying": 0.8,"who": 0.8,"winter": 0.8,"analysts": 0.79,"said": 0.78,"meeting": 0.77,"above": 0.76,"emergency": 0.75,"market": 0.75,"fixed": 0.73,"that": 0.73,"prices": 0.72,"agreement": 0.71,"buyers": 0.7 }
# 2 { "meeting": 0.88,"emergency": 0.87,"oil": 0.87,"15.8": 0.85,"analysts": 0.85,"buyers": 0.83,"above": 0.82,"said": 0.82,"ability": 0.8,"they": 0.8,"prices.": 0.79,"agreement": 0.76,"but": 0.74,"clearly": 0.74,"december.": 0.74,"however,": 0.74,"late": 0.74,"production": 0.74,"sell": 0.74,"trying": 0.74,"who": 0.74,"winter": 0.74,"quota": 0.73,"that": 0.73,"through": 0.73,"bpd": 0.7,"market": 0.7 }
# 3 [ ]
数据:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
(text.df <- data.frame(word=c("oil", "opec", "xyz"), freq=c(3, 2, 1), stringsAsFactors=FALSE))
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1