Question

我有这样的小词

library("tidyverse")
tib <- tibble(x = c("lemon", "yellow, banana", "red, big, apple"))

我想创建两个名为description和fruit的新列，并使用separate提取逗号后的最后一个字（如果有逗号，；否则，我只想复制单元格中的单词）。

到目前为止，我有

tib %>%
    separate(x, ", ", into = c("description", "fruit"), remove = FALSE)

但这并不能完全满足我的要求，产生：

# A tibble: 3 x 3
  x               description fruit 
  <chr>           <chr>       <chr> 
1 lemon           lemon       NA    
2 yellow, banana  yellow      banana
3 red, big, apple red         big   
Warning messages:
1: Expected 2 pieces. Additional pieces discarded in 1 rows [3]. 
2: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].

我想要的输出是：

  x               description fruit 
1 lemon           NA          lemon    
2 yellow, banana  yellow      banana
3 red, big, apple red, big    apple

有人可以指出我所缺少的部分吗？

EDIT

不一定要使用separate来实现目标。 mutate也可以使用，解决方案同样受到赞赏！

Answer 1

使用extract可能更好。在这里，我们可以使用捕获组将字符捕获为一个组。最好从末尾（$）开始，然后倒退，即，在捕获的末尾单词（\\w+）后面接,或空格（\\s），然后第一个捕获组（(.*?)）中的所有其他字符

library(tidyr)
library(dplyr)
tib %>%
   extract(x, into = c("description", "fruit"), remove = FALSE, '(.*?),?\\s?(\\w+$)')

或通过将定界符指定为separate，后跟空格或字符串的开头（,），后跟单词（{{1， }}）放在字符串的末尾（^）

\\w+

另外，使用$的另一个选择是在最后一个单词之前插入一个新的分隔符，然后将其用作tib %>% separate(x, into = c("description", 'fruit'), remove = FALSE, '(, |^)(?=\\w+$)') %>% mutate(description = na_if(description, ""))

separate

Answer 2

您可以使用正则表达式获取描述-替换最后一个逗号及其后的所有内容。 ",[^,]+$"匹配逗号，并以逗号结尾，直到结尾为止。

要获得成果，请使用word软件包的stringr函数来获取最后一个单词。

tib %>%
    mutate(desc = if_else(grepl(",", x), sub(",[^,]+$", "", x), NA_character_),
           fruit = stringr::word(x, -1))

Answer 3

基于正则表达式的解决方案（如此处的其他两个解决方案）可能更好。但是，如果出于任何原因，您想改用单词列表，这是另一个选择。

将文本拆分为字符串列表。除了位置length(words)上的项目之外，描述都是所有内容。水果是最后一个项目。如果使用空白字符串代替NA是可以的，则可以删除na_if位。

library(dplyr)

tib <- tibble(x = c("lemon", "yellow, banana", "red, big, apple"))
tib %>%
  mutate(words = strsplit(x, ", "),
         description = purrr::map_chr(words, ~paste(.[-length(.)], collapse = ", ")) %>% na_if(""),
         fruit = purrr::map_chr(words, last))
#> # A tibble: 3 x 4
#>   x               words     description fruit 
#>   <chr>           <list>    <chr>       <chr> 
#> 1 lemon           <chr [1]> <NA>        lemon 
#> 2 yellow, banana  <chr [2]> yellow      banana
#> 3 red, big, apple <chr [3]> red, big    apple

显然，您可以放下words列-我将其留在其中只是为了显示其类型。

单独（或类似功能），多次出现或不出现分割字符

3 个答案: