Question

我创建了一个data.frame，用于保存我的单词及其频率。现在我想对我的框架的每一行做一个findAssocs，但我无法让我的代码工作。任何帮助表示赞赏。

以下是我的data.frame term.df

的示例

＆＃13;

term.df <- data.frame(word = names(v),freq=v)

＆＃13;

word freq
ounce 8917
pack 6724
count 4992
organic 3696
frozen 2534
free 1728

＆＃13;

我创建了一个TermDocumentMatrix tdm ，以下代码按预期工作。

＆＃13;

findAssocs(tdm, 'frozen', 0.20)

＆＃13;

我想将findAssocs的输出附加为新列

这是我尝试过的代码：

＆＃13;

library(dplyr)
library(tm)
library(pbapply)

#I would like to append all findings in a new column

res <- merge(do.call(rbind.data.frame, pblapply(term.df, findAssocs(tdm, term.df$word , 0.18))),
              term.df[, c("word")], by.x="list.q", by.y="word", all.x=TRUE)

＆＃13;

编辑：至于输出。上面的单个陈述让我得到这样的东西。

＆＃13;

$yogurt
  greek ellenos     fat chobani  dannon    fage yoplait  nonfat wallaby 
   0.62    0.36    0.25    0.24    0.24    0.24    0.24    0.22    0.20

＆＃13;

我希望可以在我的原始表（ASSOC）中添加一个列，并将结果作为逗号分隔的名称：值元组，但我真的很开心。

Answer 1

我认为最简单的结构是嵌套列表：

lapply(seq_len(nrow(text.df)), function(i) {
  list(word=text.df$word[i],
       freq=text.df$freq[i],
       assoc=findAssocs(tdm, as.character(text.df$word[i]), 0.7)[[1]])
})
# [[1]]
# [[1]]$word
# [1] "oil"
# 
# [[1]]$freq
# [1] 3
# 
# [[1]]$assoc
#      15.8      opec   clearly      late    trying       who    winter  analysts 
#      0.87      0.87      0.80      0.80      0.80      0.80      0.80      0.79 
#      said   meeting     above emergency    market     fixed      that    prices 
#      0.78      0.77      0.76      0.75      0.75      0.73      0.73      0.72 
# agreement    buyers 
#      0.71      0.70 
# 
# 
# [[2]]
# [[2]]$word
# [1] "opec"
# 
# [[2]]$freq
# [1] 2
# 
# [[2]]$assoc
#    meeting  emergency        oil       15.8   analysts     buyers      above 
#       0.88       0.87       0.87       0.85       0.85       0.83       0.82 
#       said    ability       they    prices.  agreement        but    clearly 
#       0.82       0.80       0.80       0.79       0.76       0.74       0.74 
#  december.   however,       late production       sell     trying        who 
#       0.74       0.74       0.74       0.74       0.74       0.74       0.74 
#     winter      quota       that    through        bpd     market 
#       0.74       0.73       0.73       0.73       0.70       0.70 
# 
# 
# [[3]]
# [[3]]$word
# [1] "xyz"
# 
# [[3]]$freq
# [1] 1
# 
# [[3]]$assoc
# numeric(0)

根据我的经验，这比嵌套字符串更容易处理，因为您仍然可以通过访问输出列表中的相应元素来访问原始text.df对象的每一行的单词关联。

如果您真的想保留数据框架结构，那么您可以非常轻松地将findAssocs输出转换为字符串表示形式，例如使用toJSON：

library(RJSONIO)
text.df$assoc <- sapply(text.df$word, function(x) toJSON(findAssocs(tdm, x, 0.7)[[1]], collapse=""))
text.df
#   word freq
# 1  oil    3
# 2 opec    2
# 3  xyz    1
#                                                                                                                                                                                                                                                                                                                                                                                                                                                                        assoc
# 1 { "15.8":   0.87,"opec":   0.87,"clearly":    0.8,"late":    0.8,"trying":    0.8,"who":    0.8,"winter":    0.8,"analysts":   0.79,"said":   0.78,"meeting":   0.77,"above":   0.76,"emergency":   0.75,"market":   0.75,"fixed":   0.73,"that":   0.73,"prices":   0.72,"agreement":   0.71,"buyers":    0.7 }
# 2 { "meeting":   0.88,"emergency":   0.87,"oil":   0.87,"15.8":   0.85,"analysts":   0.85,"buyers":   0.83,"above":   0.82,"said":   0.82,"ability":    0.8,"they":    0.8,"prices.":   0.79,"agreement":   0.76,"but":   0.74,"clearly":   0.74,"december.":   0.74,"however,":   0.74,"late":   0.74,"production":   0.74,"sell":   0.74,"trying":   0.74,"who":   0.74,"winter":   0.74,"quota":   0.73,"that":   0.73,"through":   0.73,"bpd":    0.7,"market":    0.7 }
# 3 [  ]

数据：

library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
(text.df <- data.frame(word=c("oil", "opec", "xyz"), freq=c(3, 2, 1), stringsAsFactors=FALSE))
#   word freq
# 1  oil    3
# 2 opec    2
# 3  xyz    1

如何对每行data.frame

1 个答案: