Question

我在R中有一个数据向量，其中包含data = BURR_WK_94_91这样的条目，我想提取落在两个下划线之间的数字。所以在这种情况下得到94.字符串是可变长度所以我不能使用起始位置。

我几乎带着this回答

library(qdap)
genXtract(data, "_", "_")

但这给了我额外的数据，我不需要。有没有办法查询字符串是否是下划线之间的数字然后提取它？

Answer 1

是的，例如使用lookbehind和lookahead with regex。

data = "BURR_WK_94_91"
gsub(".*(?<=_)(\\d+)(?=_).*", "\\1", data, perl = TRUE)

[1] "94"

或者，使用stringr包，您只需要匹配确切的组。

stringr::str_extract_all(data, "(?<=_)((\\d+)*)(?=_)")

[[1]]
[1] "94"

Answer 2

一种方法是使用：

gsub(".*_(\\d+)_.*", "\\1", "BURR_WK_94_91", perl = T)

(\\d+) - denotes a capture group - capture any number of digits 
\\1 - back reference to the first capture group
.*_ - any number of characters ending with a _
_.* - any number of characters starting with a _

所以基本上你告诉函数要做的就是用捕获组替换所有内容。

如果恰好有2位数字：

 gsub(".*_(\\d{2})_.*", "\\1", "BURR_WK_94_91", perl = T)

Answer 3

您可以使用stringr包中的str_match

stringr::str_match(data, "_([0-9]{2})_") %>%
  magrittr::extract(,2)

如何在R

3 个答案: