如何对每行data.frame

时间:2015-12-14 14:44:03

标签: r dataframe tm

我创建了一个data.frame,用于保存我的单词及其频率。现在我想对我的框架的每一行做一个findAssocs,但我无法让我的代码工作。任何帮助表示赞赏。

以下是我的data.frame term.df

的示例



term.df <- data.frame(word = names(v),freq=v)
&#13;
&#13;
&#13;

&#13;
&#13;
word freq
ounce 8917
pack 6724
count 4992
organic 3696
frozen 2534
free 1728
&#13;
&#13;
&#13;

我创建了一个TermDocumentMatrix tdm ,以下代码按预期工作。

&#13;
&#13;
findAssocs(tdm, 'frozen', 0.20) 
&#13;
&#13;
&#13;

我想将findAssocs的输出附加为新列

这是我尝试过的代码:

&#13;
&#13;
library(dplyr)
library(tm)
library(pbapply)

#I would like to append all findings in a new column

res <- merge(do.call(rbind.data.frame, pblapply(term.df, findAssocs(tdm, term.df$word , 0.18))),
              term.df[, c("word")], by.x="list.q", by.y="word", all.x=TRUE)
&#13;
&#13;
&#13;

编辑: 至于输出。上面的单个陈述让我得到这样的东西。

&#13;
&#13;
$yogurt
  greek ellenos     fat chobani  dannon    fage yoplait  nonfat wallaby 
   0.62    0.36    0.25    0.24    0.24    0.24    0.24    0.22    0.20 
&#13;
&#13;
&#13;

我希望可以在我的原始表(ASSOC)中添加一个列,并将结果作为逗号分隔的名称:值元组,但我真的很开心。

1 个答案:

答案 0 :(得分:1)

我认为最简单的结构是嵌套列表:

lapply(seq_len(nrow(text.df)), function(i) {
  list(word=text.df$word[i],
       freq=text.df$freq[i],
       assoc=findAssocs(tdm, as.character(text.df$word[i]), 0.7)[[1]])
})
# [[1]]
# [[1]]$word
# [1] "oil"
# 
# [[1]]$freq
# [1] 3
# 
# [[1]]$assoc
#      15.8      opec   clearly      late    trying       who    winter  analysts 
#      0.87      0.87      0.80      0.80      0.80      0.80      0.80      0.79 
#      said   meeting     above emergency    market     fixed      that    prices 
#      0.78      0.77      0.76      0.75      0.75      0.73      0.73      0.72 
# agreement    buyers 
#      0.71      0.70 
# 
# 
# [[2]]
# [[2]]$word
# [1] "opec"
# 
# [[2]]$freq
# [1] 2
# 
# [[2]]$assoc
#    meeting  emergency        oil       15.8   analysts     buyers      above 
#       0.88       0.87       0.87       0.85       0.85       0.83       0.82 
#       said    ability       they    prices.  agreement        but    clearly 
#       0.82       0.80       0.80       0.79       0.76       0.74       0.74 
#  december.   however,       late production       sell     trying        who 
#       0.74       0.74       0.74       0.74       0.74       0.74       0.74 
#     winter      quota       that    through        bpd     market 
#       0.74       0.73       0.73       0.73       0.70       0.70 
# 
# 
# [[3]]
# [[3]]$word
# [1] "xyz"
# 
# [[3]]$freq
# [1] 1
# 
# [[3]]$assoc
# numeric(0)

根据我的经验,这比嵌套字符串更容易处理,因为您仍然可以通过访问输出列表中的相应元素来访问原始text.df对象的每一行的单词关联。

如果您真的想保留数据框架结构,那么您可以非常轻松地将findAssocs输出转换为字符串表示形式,例如使用toJSON

library(RJSONIO)
text.df$assoc <- sapply(text.df$word, function(x) toJSON(findAssocs(tdm, x, 0.7)[[1]], collapse=""))
text.df
#   word freq
# 1  oil    3
# 2 opec    2
# 3  xyz    1
#                                                                                                                                                                                                                                                                                                                                                                                                                                                                        assoc
# 1 { "15.8":   0.87,"opec":   0.87,"clearly":    0.8,"late":    0.8,"trying":    0.8,"who":    0.8,"winter":    0.8,"analysts":   0.79,"said":   0.78,"meeting":   0.77,"above":   0.76,"emergency":   0.75,"market":   0.75,"fixed":   0.73,"that":   0.73,"prices":   0.72,"agreement":   0.71,"buyers":    0.7 }
# 2 { "meeting":   0.88,"emergency":   0.87,"oil":   0.87,"15.8":   0.85,"analysts":   0.85,"buyers":   0.83,"above":   0.82,"said":   0.82,"ability":    0.8,"they":    0.8,"prices.":   0.79,"agreement":   0.76,"but":   0.74,"clearly":   0.74,"december.":   0.74,"however,":   0.74,"late":   0.74,"production":   0.74,"sell":   0.74,"trying":   0.74,"who":   0.74,"winter":   0.74,"quota":   0.73,"that":   0.73,"through":   0.73,"bpd":    0.7,"market":    0.7 }
# 3 [  ]

数据:

library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
(text.df <- data.frame(word=c("oil", "opec", "xyz"), freq=c(3, 2, 1), stringsAsFactors=FALSE))
#   word freq
# 1  oil    3
# 2 opec    2
# 3  xyz    1