R:从data.table

时间:2016-04-20 18:57:13

标签: r data.table stringr

我希望能帮助从data.table中的列中提取最后N个单词然后将其分配给新列。

 test <- data.table(original = c('the green shirt totally brings out your eyes'
                               , 'ford focus hatchback'))

原始data.table如下所示:

original
1: the green shirt totally brings out your eyes
2: ford focus hatchback

我想将最后5个单词的子集(最多)分组到新列,所以输出看起来像

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         ford focus hatchback

我试过了:

  test <- test[, extracted := paste0(tail(strsplit(original, ' ')[[1]], 5)
                                   , collapse = ' ')]

它几乎可以工作,除了“提取”列中的第一个值在整个新列中重复:

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         totally brings out your eyes

对于我的生活,我无法弄清楚这一点。我试过'stringr'中的'word'函数给了我最后一个字,但我似乎无法倒数。

任何帮助将不胜感激!

2 个答案:

答案 0 :(得分:4)

我可能会用

n = 5
patt = sprintf("\\w+( \\w+){0,%d}$", n-1)

library(stringi)
test[, ext := stri_extract(original, regex = patt)]

                                       original                          ext
1: the green shirt totally brings out your eyes totally brings out your eyes
2:                         ford focus hatchback         ford focus hatchback

评论:

  • 如果您设置n=0,则会中断,但可能没有充分理由这样做。
  • 如果您有n个不同的行(例如n=3:4),则会进行矢量化。
  • @eddi提供了一个基本类似物(对于固定的n):

    test[, ext := sub('.*?(\\w+( \\w+){4})$', '\\1', original)]
    

答案 1 :(得分:3)

Base R解决方案:

test[,extracted:=sapply(strsplit(original,'\\s+'),function(v) paste(collapse=' ',tail(v,5L)))];
##                                        original                    extracted
## 1: the green shirt totally brings out your eyes totally brings out your eyes
## 2:                         ford focus hatchback         ford focus hatchback