我希望能帮助从data.table中的列中提取最后N个单词然后将其分配给新列。
test <- data.table(original = c('the green shirt totally brings out your eyes'
, 'ford focus hatchback'))
原始data.table如下所示:
original
1: the green shirt totally brings out your eyes
2: ford focus hatchback
我想将最后5个单词的子集(最多)分组到新列,所以输出看起来像:
original extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback ford focus hatchback
我试过了:
test <- test[, extracted := paste0(tail(strsplit(original, ' ')[[1]], 5)
, collapse = ' ')]
它几乎可以工作,除了“提取”列中的第一个值在整个新列中重复:
original extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback totally brings out your eyes
对于我的生活,我无法弄清楚这一点。我试过'stringr'中的'word'函数给了我最后一个字,但我似乎无法倒数。
任何帮助将不胜感激!
答案 0 :(得分:4)
我可能会用
n = 5
patt = sprintf("\\w+( \\w+){0,%d}$", n-1)
library(stringi)
test[, ext := stri_extract(original, regex = patt)]
original ext
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback ford focus hatchback
评论:
n=0
,则会中断,但可能没有充分理由这样做。n
个不同的行(例如n=3:4
),则会进行矢量化。 @eddi提供了一个基本类似物(对于固定的n
):
test[, ext := sub('.*?(\\w+( \\w+){4})$', '\\1', original)]
答案 1 :(得分:3)
Base R解决方案:
test[,extracted:=sapply(strsplit(original,'\\s+'),function(v) paste(collapse=' ',tail(v,5L)))];
## original extracted
## 1: the green shirt totally brings out your eyes totally brings out your eyes
## 2: ford focus hatchback ford focus hatchback