我有以下名为data.table
的{{1}}。
D
我需要添加一些变量。
1。 ngram
1 in_the_years
2 the_years_thereafter
3 years_thereafter_most
4 he_wasn't_home
5 how_are_you
6 thereafter_most_of
(要求是提取前2个单词)
以下是我的代码
queryWord
2。D[,queryWord:=strsplit(ngram,"_[^_]+$")[[1]],by=ngram]
ngram queryWord
1 in_the_years in_the
2 the_years_thereafter the_years
3 years_thereafter_most years_thereafter
4 he_wasn't_home he_wasn't
5 how_are_you how_are
6 thereafter_most_of thereafter_most
。要求是提取最后一个字。
以下是期望的输出
predict
为此我编写了以下函数
ngram queryWord predict
1 in_the_years in_the years
2 the_years_thereafter the_years thereafter
3 years_thereafter_most years_thereafter most
4 he_wasn't_home he_wasn't home
5 how_are_you how_are you
6 thereafter_most_of thereafter_most of
getLastTerm<-function(x){
y<-strsplit(x,"_")
y[[1]][length(y[[1]])]
}
返回getLasTerm("in_the_years","_")
但是在"years"
对象data.table
内无效。
D
我需要帮助
答案 0 :(得分:0)
您的上一个术语功能仅选择第一个列表。请尝试以下。
getLastTerm <- function(x){
y <- strsplit(x,"_")
for (i in (1:6)) {
x[i] <- y[[i]][length(y[[i]])]
}
x
}
D$new <- getLastTerm(D$ngram)
答案 1 :(得分:0)
在解决实际问题之前,您可以简化第一步:
# option 1
D[, queryWord := strsplit(ngram,"_[^_]+$")][]
# option 2
D[, queryWord := sub('(.*)_.*$','\\1',ngram)][]
要获取predict
- 列,您不需要编写特殊功能。结合使用strsplit
,lapply
和last
:
D[, predict := lapply(strsplit(D$ngram,"_"), last)][]
或者更简单的解决方案是仅使用sub
:
D[, predict := sub('.*_(.*)$','\\1',ngram)][]
两种方法都给出了以下最终结果:
> D ngram queryWord predict 1: in_the_years in_the years 2: the_years_thereafter the_years thereafter 3: years_thereafter_most years_thereafter most 4: he_wasn't_home he_wasn't home 5: how_are_you how_are you 6: thereafter_most_of thereafter_most of
使用过的数据:
D <- fread("ngram
in_the_years
the_years_thereafter
years_thereafter_most
he_wasn't_home
how_are_you
thereafter_most_of", header = TRUE)