Question

我正在使用R / RStudio。我有一组重复图案的文件。

"protein_class_Abcd.txt"
"protein_class_Egh.txt"
"protein_class_Bdc.txt"

我想从文件名中提取“关键字” {Abcd，Egh，Bdc}，并在以后使用。关键字始终位于“ protein_class_”之后，长度为3或4个字母，后跟.txt。

Answer 1

您可以使用正则表达式。

gsub("^protein_class_([a-zA-Z]{3,4})\\.txt$","\\1",x)

其中x是3个或更多文件名的输入向量。

此处\\1是指提取()中包含的第一个捕获组。在这种情况下，它是[a-zA-Z]{3,4}。这意味着我们要在protein_class_和.txt之间匹配3-4个字母a-z或A-Z。

Answer 2

您可以使用sub和正则表达式来完成此操作。

FileNames = c("protein_class_Abcd.txt",
"protein_class_Egh.txt",
"protein_class_Bdc.txt")

sub("protein_class_(.*)\\.txt", "\\1", FileNames)
[1] "Abcd" "Egh"  "Bdc"

Answer 3

你可以做...

substr(x, 15, nchar(x)-4)

或以编程方式

prefix  = "protein_class_"
postfix = ".txt"
substr(x, nchar(prefix)+1, nchar(x)-nchar(postfix))

Answer 4

如果使用允许Perl表达式的设置，则可以在(?<=pattern)之后使用正向后视来获取"protein_class_"之后的文本。 stringi和stringr软件包在默认情况下都执行此操作，并且具有易于使用的提取功能。

files <- c("protein_class_Abcd.txt", "protein_class_Egh.txt", "protein_class_Bdc.txt")
stringr::str_extract(files, "(?<=protein_class_)[A-Za-z]{3,4}")
#> [1] "Abcd" "Egh"  "Bdc"

^{由reprex package（v0.2.1）于2019-03-06创建}

从R中的文件名获取关键字

4 个答案: