Question

我有两个数据表，如下所示：
双子座

 w1w2           freq   w1          w2      
 common names   1      common      names  
 department of  4      department  of  
 family name    6      family      name  

bigrams = setDT(structure(list(w1w2 = c("common names", "department of", "family name"
), freq = c(1L, 4L, 6L), w1 = c("common", "department", "family"
), w2 = c("names", "of", "name")), .Names = c("w1w2", "freq", 
"w1", "w2"), row.names = c(NA, -3L), class = "data.frame"))

unigrams

w1            freq  
common        2  
department    3  
family        4  
name          5  
names         1  
of            9  

unigrams = setDT(structure(list(w1 = c("common", "department", "family", "name", 
"names", "of"), freq = c(2L, 3L, 4L, 5L, 1L, 9L)), .Names = c("w1", 
"freq"), row.names = c(NA, -6L), class = "data.frame"))

所需的输出

 w1w2           freq   w1          w2      w1freq    w2freq  
 common names   1      common      names   2         1
 department of  4      department  of      3         9
 family name    6      family      name    4         5

到目前为止我做了什么

setkey(bigrams, w1)
setkey(unigrams, w1)
result <- bigrams[unigrams]

这为i.freq提供了w1列，但当我尝试对w2执行相同操作时，i.freq列会更新以反映{{1}的频率}}。

如何在单独的列中获取w2和w1的频率？

注意：我已经看到了data.table Lookup value and translate和Modify column of a data.table based on another column and add the new column

的解决方案

Answer 1

您可以进行两次连接，在data.table的v1.9.6中，您可以为不同的列名指定on=参数。

library(data.table)

bigrams[unigrams, on=c("w1"), nomatch = 0][unigrams, on=c(w2 = "w1"), nomatch = 0]

            w1w2 freq         w1    w2 i.freq i.freq.1
1:   family name    6     family  name      4        5
2:  common names    1     common names      2        1
3: department of    4 department    of      3        9

Answer 2

你可以通过一些重塑来做到这一点。

library(dplyr)
library(tidyr)

bigrams %>%
  rename(w1w2_string = w1w2,
         w1w2_freq = freq) %>%
  gather(order, string,
         w1, w2) %>%
  left_join(unigrams %>%
              rename(string = w1) ) %>%
  gather(type, value,
         string, freq) %>%
  unite(order_type, order, type) %>%
  spread(order_type, value)

编辑：说明

你可以做的第一个观察是，bigrams实际上包含三个不同分析单元的信息：一个二元组和两个unigrams。转换为长形式，以便分析单位是单字形。然后我们可以合并其他unigram数据。现在请注意，你的unigram每行有两条不同的信息：unigram的频率和unigram的文本。再次转换为长格式，以便分析单位是关于单字组的一条信息。现在进行传播，以便每个新列都是关于unigram的一种信息。

在数据表中查找数据并将其添加到新列

2 个答案: