Question

我有一列，名为“文档”，具有以下结构

1994_post_elections_Mandela.txt
1994_pre_elections_deKlerk.txt
1995_Mandela.txt
1996_Mandela.txt
1997_Mandela.txt
1998_Mandela.txt
1999_post_elections_Mandela.txt
1999_pre_elections_Mandela.txt
2000_Mbeki.txt

我想做的是提取总裁的姓名，该姓名始终位于“ .txt”之前，然后将其弹出到新列中-我不介意其他字符/数字一起进入另一列。由于各种原因（在这里我将不作解释），我需要使用separate包中的tidyr函数。

我试图听从here的回答，但我的尝试失败了...

speech_gamma_exp<-speech_gamma %>%
separate(document, into=c("col1", "col2"), sep = "(\\_)(?!_*\\_)")

Answer 1

我们使用R基座gsub：

> df1$President <- gsub(".*_(\\w+)\\.txt$", "\\1", df1$V1)
> df1
                               V1 President
1 1994_post_elections_Mandela.txt   Mandela
2  1994_pre_elections_deKlerk.txt   deKlerk
3                1995_Mandela.txt   Mandela
4                1996_Mandela.txt   Mandela
5                1997_Mandela.txt   Mandela
6                1998_Mandela.txt   Mandela
7 1999_post_elections_Mandela.txt   Mandela
8  1999_pre_elections_Mandela.txt   Mandela
9                  2000_Mbeki.txt     Mbeki

假设您的data.frame是：

df1 <- read.table(text="1994_post_elections_Mandela.txt
1994_pre_elections_deKlerk.txt
1995_Mandela.txt
1996_Mandela.txt
1997_Mandela.txt
1998_Mandela.txt
1999_post_elections_Mandela.txt
1999_pre_elections_Mandela.txt
2000_Mbeki.txt", header=FALSE, stringsAsFactors=FALSE)

Answer 2

因为您说必须使用separate，所以这是一种方法。我们可以使用str_count来使用_分隔符来获得最大的拆分次数，然后基于此为into设置separate的参数。与fill = "left"结合使用，这意味着我们知道最后的拆分（President.txt）将在最后一列中。然后，您可以根据需要删除.txt和其他列。

但是，我认为将总统姓名直接mutate直接插入str_extract到一列中要简单得多，如第二个示例所示。这会使用环视方法来匹配以_开头和以.txt结尾的字母。

library(tidyverse)
tbl <- tibble(
  document = c(
    "1994_post_elections_Mandela.txt",
    "1994_pre_elections_deKlerk.txt",
    "1995_Mandela.txt",
    "1996_Mandela.txt",
    "1997_Mandela.txt",
    "1998_Mandela.txt",
    "1999_post_elections_Mandela.txt",
    "1999_pre_elections_Mandela.txt",
    "2000_Mbeki.txt"
  )
)

tbl %>%
  separate(
    col = document,
    into = str_c(
      "col",
      1:(as.integer(max(str_count(.$document, "_"))) + 1)
    ),
    sep = "_",
    fill = "left"
  )
#> # A tibble: 9 x 4
#>   col1  col2  col3      col4       
#>   <chr> <chr> <chr>     <chr>      
#> 1 1994  post  elections Mandela.txt
#> 2 1994  pre   elections deKlerk.txt
#> 3 <NA>  <NA>  1995      Mandela.txt
#> 4 <NA>  <NA>  1996      Mandela.txt
#> 5 <NA>  <NA>  1997      Mandela.txt
#> 6 <NA>  <NA>  1998      Mandela.txt
#> 7 1999  post  elections Mandela.txt
#> 8 1999  pre   elections Mandela.txt
#> 9 <NA>  <NA>  2000      Mbeki.txt

tbl %>%
  mutate(president = str_extract(document, "(?<=_)[:alpha:]*?(?=\\.txt)"))
#> # A tibble: 9 x 2
#>   document                        president
#>   <chr>                           <chr>    
#> 1 1994_post_elections_Mandela.txt Mandela  
#> 2 1994_pre_elections_deKlerk.txt  deKlerk  
#> 3 1995_Mandela.txt                Mandela  
#> 4 1996_Mandela.txt                Mandela  
#> 5 1997_Mandela.txt                Mandela  
#> 6 1998_Mandela.txt                Mandela  
#> 7 1999_post_elections_Mandela.txt Mandela  
#> 8 1999_pre_elections_Mandela.txt  Mandela  
#> 9 2000_Mbeki.txt                  Mbeki

由reprex package（v0.2.0）于2018-09-10创建。

Answer 3

使用gsub或stringr / stringi非常简单。在绕过箍圈之后，我只能提出一个基于tidyr :: separate的解决方案：

#### Create Data ####
pres_vector <- c("1994_post_elections_Mandela.txt", "1994_pre_elections_deKlerk.txt",
     "1995_Mandela.txt", "1996_Mandela.txt", "1997_Mandela.txt", "1998_Mandela.txt",
     "1999_post_elections_Mandela.txt", "1999_pre_elections_Mandela.txt", "2000_Mbeki.txt")

#### Libraries ####
library(stringi)
library(tidyr)

#### Solution ####    
pres_vector %>% stri_reverse %>% data.frame(x = .) %>% 
    separate(x, c("file_ext", "pres")) %>% { .[["pres"]] } %>% stri_reverse -> pres_names

print(pres_names)
[1] "Mandela" "deKlerk" "Mandela" "Mandela" "Mandela" "Mandela" "Mandela" "Mandela" "Mbeki"

这是有效的，因为字符串的模式。默认情况下，分隔符将拆分为字母数字字符。字符串的最后一部分是文件扩展名，字符串的第二部分是总统的名字。

因此，将字符串反转将（扩展名）文件扩展名放在首位，将（总统的）文件名放在第二位。单独允许我们提取前两个部分和子集，仅保留主席的姓名。最后，颠倒该子字符串（总统的名字反过来）会给我们总统的名字（没有颠倒）。

Answer 4

我更喜欢将 stringr 用于此类任务（10-12-99也可以）

gsub

Regex Demo

基于正则表达式

4 个答案: