我在R中有两个数据帧。第一个列出了一定数量的关键字及其频率(在文本中检测到的次数)。第二个数据帧显示关键字的同时出现(例如,当两个关键字出现在同一章中时)。我想为数据创建一个附加列,然后将其用作权重。第三列(“ w”)将基于w_(x1,x2)=共现/(x1被列为关键字的次数+ x2被列为关键字的次数)。知道我应该怎么做吗?
Key words Frequency
art 5
risk 3
trade 1
X1 X2 w_(x1,x2)
art risk 0.125
art trade 0.1666667
得到了这段代码,但是没有用,我还是一个业余爱好者。也许有一些简单的事情?
e <- df[,"keywords"]$`keywords`%>%
str_split("\r\r\n") %>%
lapply(function(x){expand.grid(x, x, w = (1 / length(x) + length(x)), stringsAsFactors = FALSE)}) %>%
bind_rows
e <- apply(e[, -3], 1, str_sort) %>%
t %>%
data.frame(stringsAsFactors = FALSE) %>%
mutate(w = e$w)
答案 0 :(得分:0)
您可以使用流行的tidyverse
软件包执行计算。根据您的评论,问题很简单。
word_freq <- read.table(header = TRUE, stringsAsFactors = FALSE,
text = "Key_words Frequency
art 5
risk 3
trade 1")
co_occur <- read.table(header = TRUE, stringsAsFactors = FALSE,
text ="X1 X2 w
art risk 0.1250000
art trade 0.1666667")
library(tidyverse)
#
# Get the frequencies for each of X1 and X2, sum, and then compute the new weight
#
chapt_occur <- co_occur %>% left_join(word_freq, by = c(X1 = "Key_words")) %>%
left_join(word_freq, c(X2 = "Key_words"), suffix = c(".X1", ".X2")) %>%
mutate(comb_freq = Frequency.X1+Frequency.X2,
w_X1X2 = w/comb_freq)
给出结果
chapt_occur
X1 X2 w Frequency.X1 Frequency.X2 comb_freq w_X1X2
art risk 0.1250000 5 3 8 0.01562500
art trade 0.1666667 5 1 6 0.02777778
包含中间计算的列可以通过使用
删除 chapt_occur <- chapt_occur %>% select( -c(Frequency.X1, Frequency.X2, comb_freq))
这仅使用基本的dydyverse函数。您可以通过许多开发人员R for Data Science
了解有关tidyverse
许多地方的更多信息,包括在线书籍