在data.table中提取字符串

时间:2018-03-11 17:55:52

标签: r data.table

我有以下名为data.table的{​​{1}}。

D

我需要添加一些变量。

1。 ngram 1 in_the_years 2 the_years_thereafter 3 years_thereafter_most 4 he_wasn't_home 5 how_are_you 6 thereafter_most_of (要求是提取前2个单词) 以下是我的代码

queryWord

2。D[,queryWord:=strsplit(ngram,"_[^_]+$")[[1]],by=ngram] ngram queryWord 1 in_the_years in_the 2 the_years_thereafter the_years 3 years_thereafter_most years_thereafter 4 he_wasn't_home he_wasn't 5 how_are_you how_are 6 thereafter_most_of thereafter_most 。要求是提取最后一个字。 以下是期望的输出

predict

为此我编写了以下函数

                   ngram        queryWord            predict
1          in_the_years           in_the             years
2  the_years_thereafter        the_years             thereafter
3 years_thereafter_most        years_thereafter      most
4        he_wasn't_home        he_wasn't             home 
5           how_are_you          how_are             you
6    thereafter_most_of  thereafter_most             of

getLastTerm<-function(x){ y<-strsplit(x,"_") y[[1]][length(y[[1]])] } 返回getLasTerm("in_the_years","_")但是在"years"对象data.table内无效。

D

我需要帮助

2 个答案:

答案 0 :(得分:0)

您的上一个术语功能仅选择第一个列表。请尝试以下。

getLastTerm <- function(x){
  y <- strsplit(x,"_")

  for (i in (1:6)) { 
    x[i] <- y[[i]][length(y[[i]])]
  }
  x
}


D$new <- getLastTerm(D$ngram)

答案 1 :(得分:0)

在解决实际问题之前,您可以简化第一步:

# option 1
D[, queryWord := strsplit(ngram,"_[^_]+$")][]
# option 2
D[, queryWord := sub('(.*)_.*$','\\1',ngram)][]

要获取predict - 列,您不需要编写特殊功能。结合使用strsplitlapplylast

D[, predict := lapply(strsplit(D$ngram,"_"), last)][]

或者更简单的解决方案是仅使用sub

D[, predict := sub('.*_(.*)$','\\1',ngram)][]

两种方法都给出了以下最终结果:

> D
                   ngram        queryWord    predict
1:          in_the_years           in_the      years
2:  the_years_thereafter        the_years thereafter
3: years_thereafter_most years_thereafter       most
4:        he_wasn't_home        he_wasn't       home
5:           how_are_you          how_are        you
6:    thereafter_most_of  thereafter_most         of

使用过的数据:

D <- fread("ngram
in_the_years
the_years_thereafter
years_thereafter_most
he_wasn't_home
how_are_you
thereafter_most_of", header = TRUE)