我有两个数据表,如下所示:
双子座
w1w2 freq w1 w2
common names 1 common names
department of 4 department of
family name 6 family name
bigrams = setDT(structure(list(w1w2 = c("common names", "department of", "family name"
), freq = c(1L, 4L, 6L), w1 = c("common", "department", "family"
), w2 = c("names", "of", "name")), .Names = c("w1w2", "freq",
"w1", "w2"), row.names = c(NA, -3L), class = "data.frame"))
unigrams
w1 freq
common 2
department 3
family 4
name 5
names 1
of 9
unigrams = setDT(structure(list(w1 = c("common", "department", "family", "name",
"names", "of"), freq = c(2L, 3L, 4L, 5L, 1L, 9L)), .Names = c("w1",
"freq"), row.names = c(NA, -6L), class = "data.frame"))
所需的输出
w1w2 freq w1 w2 w1freq w2freq
common names 1 common names 2 1
department of 4 department of 3 9
family name 6 family name 4 5
到目前为止我做了什么
setkey(bigrams, w1)
setkey(unigrams, w1)
result <- bigrams[unigrams]
这为i.freq
提供了w1
列,但当我尝试对w2
执行相同操作时,i.freq
列会更新以反映{{1}的频率}}。
如何在单独的列中获取w2
和w1
的频率?
注意:我已经看到了data.table Lookup value and translate和Modify column of a data.table based on another column and add the new column
的解决方案答案 0 :(得分:3)
您可以进行两次连接,在data.table
的v1.9.6中,您可以为不同的列名指定on=
参数。
library(data.table)
bigrams[unigrams, on=c("w1"), nomatch = 0][unigrams, on=c(w2 = "w1"), nomatch = 0]
w1w2 freq w1 w2 i.freq i.freq.1
1: family name 6 family name 4 5
2: common names 1 common names 2 1
3: department of 4 department of 3 9
答案 1 :(得分:2)
你可以通过一些重塑来做到这一点。
library(dplyr)
library(tidyr)
bigrams %>%
rename(w1w2_string = w1w2,
w1w2_freq = freq) %>%
gather(order, string,
w1, w2) %>%
left_join(unigrams %>%
rename(string = w1) ) %>%
gather(type, value,
string, freq) %>%
unite(order_type, order, type) %>%
spread(order_type, value)
编辑:说明
你可以做的第一个观察是,bigrams实际上包含三个不同分析单元的信息:一个二元组和两个unigrams。转换为长形式,以便分析单位是单字形。然后我们可以合并其他unigram数据。现在请注意,你的unigram每行有两条不同的信息:unigram的频率和unigram的文本。再次转换为长格式,以便分析单位是关于单字组的一条信息。现在进行传播,以便每个新列都是关于unigram的一种信息。