Question

我想保留FIRST非核心后的部分。请参阅示例代码。

colnames(df)
"EGAR00001341740_P32_1"    "EGAR00001341741_PN32"

我的尝试，但没有给P32_1但只有P32这是错误的。

sapply(strsplit(colnames(df), split='_', fixed=TRUE), function(x) (x[2]))

所需的输出：P32_1, PN32

Answer 1

可以通过匹配字符串的开头（[^_]*）中不是下划线（^）的零个或多个字符，后跟下划线（{{1）来使用正则表达式来完成}}并用空格（_）

替换它

""

使用colnames(df) <- sub("^[^_]*_", "", colnames(df)) colnames(df) #[1] "P32_1" "PN32"，它会在strsplit字符出现时分割。一个选项是来自split的{{1}}，其中有一个选项可以指定'n'，即分割部分的数量。如果我们选择str_split，我们会得到2个子字符串，因为它只会在第一个stringr

分割

n = 2

Answer 2

以下是一些方法。第一个修复问题中的代码，剩下的代码是替代方案。除了（6）以外，全部只使用基数。（4）和（7）假设第一个字段是固定长度，这就是问题中的情况。

x <- c("EGAR00001341740_P32_1", "EGAR00001341741_PN32")

# 1 - using strsplit
sapply(strsplit(x, "_"), function(x) paste(x[-1], collapse = "-"))
## [1] "P32_1" "PN32"

# 2 - a bit easier using sub.  *? is a non-greedy match
sub(".*?_", "", x)
## [1] "P32_1" "PN32" 

# 3 - locate the first underscore and extract all after that
substring(x, regexpr("_", x) + 1)
## [1] "P32_1" "PN32" 

# 4 - if the first field is fixed length as in the example
substring(x, 17)
## [1] "P32_1" "PN32" 

# 5 - replace first _ with character that does not appear and remove all until it
sub(".*;", "", sub("_", ";", x))
## [1] "P32_1" "PN32" 

# 6 - extract everything after first _
library(gsubfn)
strapplyc(x, "_(.*)", simplify = TRUE)
## [1] "P32_1" "PN32" 

# 7 - like (4) assumes fixed length first field
read.fwf(textConnection(x), widths = c(16, 99), as.is = TRUE)$V2
## [1] "P32_1" "PN32"

strsplit并在第一个下划线之前保持一部分

2 个答案: