根据data.table中各组之间的匹配返回一个变量

时间:2018-09-04 18:06:03

标签: r data.table

我是data.table的新手,并不完全了解它。假设我有下表的ngrams:

require(data.table)
DT<-data.table(
  ngram=c("how","how are","how are you","how are you doing"),
  Freq=c(15000,1500,150,15),
  n=c(1,2,3,4),
  w1=c(37,37,37,37),
  w2=c(NA,13,13,13),
  w3=c(NA,NA,7,7),
  w4=c(NA,NA,NA,95)
)

> DT
               ngram  Freq n w1 w2 w3 w4
1:               how 15000 1 37 NA NA NA
2:           how are  1500 2 37 13 NA NA
3:       how are you   150 3 37 13  7 NA
4: how are you doing    15 4 37 13  7 95

其中n表示ngram的类型(例如1 = unigram,2 = bigram等),w1到w4是每个ngram中单词的整数索引,而Freq是数据中ngram出现的次数。 >

我如何基于一个ngram中一个单词与另一个ngram中一个单词的匹配来获得一个ngram的频率。对于二元(n = 2)'怎么样',我如何通过将'怎么样'的w1与'怎么样'的w1匹配来获得unigram'怎么样'的频率?或者,对于三元组“你好吗”,我如何通过将“你好吗”的w1 + w2与“你好吗”的w1 + w2匹配来获得二元组“你好”的频率?

例如,我尝试过:

DT[n==2,B:=Freq[match(w1[n==1],w1[n==2])]]

DT[n==2,B:=Freq[which(w1[n==1]==w1[n==2])]]

但仅获取NA:

               ngram  Freq n w1 w2 w3 w4  B
1:               how 15000 1 37 NA NA NA NA
2:           how are  1500 2 37 13 NA NA NA
3:       how are you   150 3 37 13  7 NA NA
4: how are you doing    15 4 37 13  7 95 NA

我想得到:

               ngram  Freq n w1 w2 w3 w4     B
1:               how 15000 1 37 NA NA NA    NA
2:           how are  1500 2 37 13 NA NA 15000
3:       how are you   150 3 37 13  7 NA  1500
4: how are you doing    15 4 37 13  7 95   150

任何帮助都将不胜感激!

3 个答案:

答案 0 :(得分:1)

您可以逐行浏览,找到用作连接键的'w'列,然后对这些w列执行ngrams小于当前行的行进行连接:

DT[, B := 
    {
        k <- as.integer(.BY) - 1L
        if (k > 0) {
            nm <- head(grep("^w", names(.SD)[!is.na(.SD)], value=TRUE), k)
            DT[n < .BY][.SD, x.Freq, on=nm]
        } else NA_real_
    },
    by=.(n)]

输出:

               ngram  Freq n w1 w2 w3 w4     B
1:               how 15000 1 37 NA NA NA    NA
2:           how are  1500 2 37 13 NA NA 15000
3:       how are you   150 3 37 13  7 NA  1500
4: how are you doing    15 4 37 13  7 95   150

在弗兰克评论后整理代码:

DT[, B := 
    {
        if (n > 1L) {
            nm <- head(grep("^w", names(.SD)[!is.na(.SD)], value=TRUE), n-1L)
            DT[n==.BY$n-1L][.SD, x.Freq, on=nm]
        }
    },
    by=.(n)]

答案 1 :(得分:1)

chinsoon答案的一种变体,在加入前将第n个单词覆盖为NA:

wcols = paste0("w", 1:4)    
DT[, v := 
  DT[n == .BY$n - 1L][replace(.SD, .BY$n, NA_real_), on=wcols, x.Freq]
, by=n, .SDcols=wcols]

这种方法虽然更简洁,但效率可能较低,因为我将加入所有列,而不仅仅是n-1

答案 2 :(得分:0)

我键入n,将B设为DT的子集,并颠倒了比赛顺序:

setkey(DT,n)
DT[.(2),B:=DT[,Freq[match(w1[n==2L],w1[n==1L],nomatch=NA)]]]

> DT
               ngram  Freq n w1 w2 w3 w4     B
1:               how 15000 1 37 NA NA NA    NA
2:           how are  1500 2 37 13 NA NA 15000
3:       how are you   150 3 37 13  7 NA  1500
4: how are you doing    15 4 37 13  7 95   150

快速处理大型数据集。