在数据表中查找数据并将其添加到新列

时间:2016-04-13 03:29:02

标签: r data.table

我有两个数据表,如下所示:
双子座

 w1w2           freq   w1          w2      
 common names   1      common      names  
 department of  4      department  of  
 family name    6      family      name  

bigrams = setDT(structure(list(w1w2 = c("common names", "department of", "family name"
), freq = c(1L, 4L, 6L), w1 = c("common", "department", "family"
), w2 = c("names", "of", "name")), .Names = c("w1w2", "freq", 
"w1", "w2"), row.names = c(NA, -3L), class = "data.frame"))

unigrams

w1            freq  
common        2  
department    3  
family        4  
name          5  
names         1  
of            9  

unigrams = setDT(structure(list(w1 = c("common", "department", "family", "name", 
"names", "of"), freq = c(2L, 3L, 4L, 5L, 1L, 9L)), .Names = c("w1", 
"freq"), row.names = c(NA, -6L), class = "data.frame"))

所需的输出

 w1w2           freq   w1          w2      w1freq    w2freq  
 common names   1      common      names   2         1
 department of  4      department  of      3         9
 family name    6      family      name    4         5

到目前为止我做了什么

setkey(bigrams, w1)
setkey(unigrams, w1)
result <- bigrams[unigrams]

这为i.freq提供了w1列,但当我尝试对w2执行相同操作时,i.freq列会更新以反映{{1}的频率}}。

如何在单独的列中获取w2w1的频率?

注意:我已经看到了data.table Lookup value and translateModify column of a data.table based on another column and add the new column

的解决方案

2 个答案:

答案 0 :(得分:3)

您可以进行两次连接,在data.table的v1.9.6中,您可以为不同的列名指定on=参数。

library(data.table)

bigrams[unigrams, on=c("w1"), nomatch = 0][unigrams, on=c(w2 = "w1"), nomatch = 0]

            w1w2 freq         w1    w2 i.freq i.freq.1
1:   family name    6     family  name      4        5
2:  common names    1     common names      2        1
3: department of    4 department    of      3        9

答案 1 :(得分:2)

你可以通过一些重塑来做到这一点。

library(dplyr)
library(tidyr)

bigrams %>%
  rename(w1w2_string = w1w2,
         w1w2_freq = freq) %>%
  gather(order, string,
         w1, w2) %>%
  left_join(unigrams %>%
              rename(string = w1) ) %>%
  gather(type, value,
         string, freq) %>%
  unite(order_type, order, type) %>%
  spread(order_type, value)

编辑:说明

你可以做的第一个观察是,bigrams实际上包含三个不同分析单元的信息:一个二元组和两个unigrams。转换为长形式,以便分析单位是单字形。然后我们可以合并其他unigram数据。现在请注意,你的unigram每行有两条不同的信息:unigram的频率和unigram的文本。再次转换为长格式,以便分析单位是关于单字组的一条信息。现在进行传播,以便每个新列都是关于unigram的一种信息。