Question

我希望能帮助从data.table中的列中提取最后N个单词然后将其分配给新列。

 test <- data.table(original = c('the green shirt totally brings out your eyes'
                               , 'ford focus hatchback'))

原始data.table如下所示：

original
1: the green shirt totally brings out your eyes
2: ford focus hatchback

我想将最后5个单词的子集（最多）分组到新列，所以输出看起来像：

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         ford focus hatchback

我试过了：

  test <- test[, extracted := paste0(tail(strsplit(original, ' ')[[1]], 5)
                                   , collapse = ' ')]

它几乎可以工作，除了“提取”列中的第一个值在整个新列中重复：

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         totally brings out your eyes

对于我的生活，我无法弄清楚这一点。我试过'stringr'中的'word'函数给了我最后一个字，但我似乎无法倒数。

任何帮助将不胜感激！

Answer 1

我可能会用

n = 5
patt = sprintf("\\w+( \\w+){0,%d}$", n-1)

library(stringi)
test[, ext := stri_extract(original, regex = patt)]

                                       original                          ext
1: the green shirt totally brings out your eyes totally brings out your eyes
2:                         ford focus hatchback         ford focus hatchback

评论：

如果您设置n=0，则会中断，但可能没有充分理由这样做。
如果您有n个不同的行（例如n=3:4），则会进行矢量化。

@eddi提供了一个基本类似物（对于固定的n）：

test[, ext := sub('.*?(\\w+( \\w+){4})$', '\\1', original)]

Answer 2

Base R解决方案：

test[,extracted:=sapply(strsplit(original,'\\s+'),function(v) paste(collapse=' ',tail(v,5L)))];
##                                        original                    extracted
## 1: the green shirt totally brings out your eyes totally brings out your eyes
## 2:                         ford focus hatchback         ford focus hatchback

R：从data.table

2 个答案: