Question

我的文件名为

Hughson.George_54_4
Ifran.Dean_51_3
休斯顿，阿曼达_49_6

我想创建一个数据框，其中每一行都是从文件名中提取的信息，形式为作者，卷，问题。

我可以提取名称和数量，但似乎无法获取问题编号。使用“字符串”程序包，我完成了以下操作，这给了我_4而不只是4。

[^a-z](?:[^_]+_){0}([^_ ]+$)

我该如何解决？

Answer 1

您正在寻找：

read.table(text = string, sep ='_', col.names = c('Author', 'Volume', 'Issue'))

          Author Volume Issue
1 Hughson.George     54     4
2     Ifran.Dean     51     3
3 Houston.Amanda     49     6

其中

string <- c("Hughson.George_54_4", "Ifran.Dean_51_3", "Houston.Amanda_49_6")

编辑：您正在寻找：

 read.table(text = string, sep ='_', fill=TRUE)

Answer 2

如果是最后一位，我们可以使用base R方法提取出来

as.numeric(substring(str1, nchar(str1)))

或与sub

as.numeric(sub(".*_", "", str1))
#[1] 4 3 6

如果我们需要将其拆分为单独的列，则一个选项是separate中的tidyverse，它将split的列基于定界符（_划分为单个列并确保列的类型为convert ed

library(tidyverse)
data_frame(col1 = str1) %>%
    separate(col1, into = c("Author", "Volume", "Issue"), sep = "_", convert = TRUE)
# A tibble: 3 x 3
#  Author         Volume Issue
#  <chr>          <chr>  <chr>
#1 Hughson.George 54     4    
#2 Ifran.Dean     51     3    
#3 Houston.Amanda 49     6

数据

str1 <- c("Hughson.George_54_4", "Ifran.Dean_51_3", "Houston.Amanda_49_6")

Answer 3

正则表达式的[^a-z]部分与最后一位数字前面的_相匹配。只需使用一些内容来匹配末尾的数字即可：

x1 <- c("Hughson.George_54_4", "Ifran.Dean_51_3", "Houston.Amanda_49_6")

str_extract(x1,"([^_]+$)")
[1] "4" "3" "6"

str_extract(x1,"\\d+$")
[1] "4" "3" "6"

尽管如此，您的总体目标似乎是strsplit的工作：

data.frame(do.call("rbind",strsplit(sub("\\."," ",x1),"_")))
              X1 X2 X3
1 Hughson George 54  4
2     Ifran Dean 51  3
3 Houston Amanda 49  6

提取文本中下划线之间的数字

3 个答案:

数据