Question

在R中，我想从字符的模式中提取子字符串向量。我的字符向量x的前几个条目（总共400个）如下所示：

x <- c(
  ">104K_THEPA | FPrate:0.000 | OMEGA:D-904",
  ">2MMP_ARATH | FPrate:0.006 | OMEGA:S-349",
  ">5MMP_ARATH | FPrate:0.018 | OMEGA:S-337",
  ">5NTD_DIPOM | FPrate:0.026 | OMEGA:S-552",
  ">5NTD_HUMAN | FPrate:0.154 | OMEGA:S-549",
  ">5NTD_MOUSE | FPrate:1.000 | OMEGA:S-551"
)

我想提取FPrate:之后的4位数字，并最终提取OMEGA:之后的字母和最后3位数字。我是使用正则表达式的新手，并花了数小时来解决这个问题并在网上搜索解决方案，但是没有运气。

所需的输出将是：

[1] "0.000"  
[2] "0.006"  
[3] "0.018"  
[4] "0.026"  
[5] "0.154"  
[6] "1.000"

到目前为止，我已经提出了以下代码行：

gsub("^[^(FPrate:)]*(FPrate:)|(\\s\\|\\sOMEGA:)[^(\\s\\|\\sOMEGA:)]*$", "", x)

适用于我的某些条目，但并非全部。

实现此目标的最佳方法是什么？

Answer 1

使用str_match中的stringr函数仅提取匹配的特定部分（匹配组）将使您的问题更加容易：

stringr::str_match(vec, 'FPrate:([^ ]*).*OMEGA:([^ ]*)')[,c(2,3)]
     [,1]    [,2]   
[1,] "0.000" "D-904"
[2,] "0.006" "S-349"
[3,] "0.018" "S-337"
[4,] "0.026" "S-552"
[5,] "0.154" "S-549"
[6,] "1.000" "S-551"

str_match匹配正则表达式并返回一个数据帧：第一列是整个匹配项，而随后的每一列都是正则表达式中括号中的内容的顺序。因此，通过采用第二列和第三列，我们只得到'FPrate:'之后和'OMEGA:'之后的非空白序列。

您可以根据需要添加任意多个捕获组。例如，如果您想将OMEGA分为字母和数字，则只需使用更多的组即可：

stringr::str_match(vec, 'FPrate:([^ ]*).*OMEGA:([[:alnum:]])-(\\d*)')[,c(2:4)]
     [,1]    [,2] [,3] 
[1,] "0.000" "D"  "904"
[2,] "0.006" "S"  "349"
[3,] "0.018" "S"  "337"
[4,] "0.026" "S"  "552"
[5,] "0.154" "S"  "549"
[6,] "1.000" "S"  "551"

Answer 2

这将使用stringi掩盖的不受约束的stringr操作以及可读/记录的正则表达式：

library(stringi)
library(tidyverse)

您的数据：

c(
  ">104K_THEPA | FPrate:0.000 | OMEGA:D-904",
  ">2MMP_ARATH | FPrate:0.006 | OMEGA:S-349",
  ">5MMP_ARATH | FPrate:0.018 | OMEGA:S-337",
  ">5NTD_DIPOM | FPrate:0.026 | OMEGA:S-552",
  ">5NTD_HUMAN | FPrate:0.154 | OMEGA:S-549",
  ">5NTD_MOUSE | FPrate:1.000 | OMEGA:S-551"
) -> xdat

提取：

stri_match_first_regex(
  xdat,
  "
  FPrate:([[:digit:]]\\.[[:digit:]]+) # this grabs the FPrate amount
  .*                                  # this skips a bit generically just in case it ever differs
  OMEGA:([[:alnum:]]-[[:digit:]]+)    # this grabs the OMEGA info
  ",
  opts_regex = stri_opts_regex(comments = TRUE)
)[,2:3] %>% 
  as_data_frame() %>% 
  mutate(V1 = as.numeric(V1), V2 = stri_replace_first_fixed(V2, "-", ""))
## # A tibble: 6 x 2
##      V1 V2   
##   <dbl> <chr>
## 1 0     D904 
## 2 0.006 S349 
## 3 0.018 S337 
## 4 0.026 S552 
## 5 0.154 S549 
## 6 1     S551

也：在问题中的正则表达式上加倍尝试。正则表达式不是很漂亮，并且经常使用一段时间后才变得有意义。

Answer 3

基本R

以下是一些基本的R解决方案：

1）如果您只需要FPrate字段（这就是问题所要求的全部内容），那么此sub就可以了。不需要软件包。

as.numeric(sub(".*FPrate:(\\S+) .*", "\\1", x))
## [1] 0.000 0.006 0.018 0.026 0.154 1.000

2）如果要解析出所有name：value字段，则再次仅使用基数R用换行符替换前导非空格，然后替换每次出现的空格-字符-空格以及换行符。现在它是dcf格式，因此请使用read.dcf读取它，并给出字符矩阵m。那可能已经足够好了，但是如果您想对每一列进行适当类型转换的数据帧，则将其转换为数据帧d并应用type.convert。该解决方案非常通用，因为它不会对FPrate和OMEGA进行硬编码。

s <- gsub(" . ", "\n", sub("\\S+", "\n", x))
m <- read.dcf(textConnection(s))
d <- as.data.frame(m, stringsAsFactors = FALSE)
d[] <- lapply(d, type.convert)

给予：

> m
     FPrate  OMEGA  
[1,] "0.000" "D-904"
[2,] "0.006" "S-349"
[3,] "0.018" "S-337"
[4,] "0.026" "S-552"
[5,] "0.154" "S-549"
[6,] "1.000" "S-551"

> d
  FPrate OMEGA
1  0.000 D-904
2  0.006 S-349
3  0.018 S-337
4  0.026 S-552
5  0.154 S-549
6  1.000 S-551

3）：该代码使用strcapture并产生一个数据帧，其类型根据proto进行了转换：

proto <- data.frame(FPrate = numeric(0), OMEGA = character(0))
strcapture(".*FPrate:(\\S+) . OMEGA:(\\S+)", x, proto)

给予：

  FPrate OMEGA
1  0.000 D-904
2  0.006 S-349
3  0.018 S-337
4  0.026 S-552
5  0.154 S-549
6  1.000 S-551

4）在这一步中，我们用空格替换冒号，读取read.table剩下的内容，提取所需的列，然后设置列名称。不使用正则表达式。

d <- read.table(text = chartr(":", " ", x), as.is = TRUE)[c(4, 7)]
names(d) <- c("FPrate", "OMEGA")

提供此数据框：

  FPrate OMEGA
1  0.000 D-904
2  0.006 S-349
3  0.018 S-337
4  0.026 S-552
5  0.154 S-549
6  1.000 S-551

gsubfn

5）。此解决方案使用gsubfn软件包。

library(gsubfn)

pat <- ".*FPrate:(\\S+).*OMEGA:(\\S+)"
nms <- c("FPrate", "OMEGA")
read.pattern(text = x, pattern = pat, as.is = TRUE, col.names = nms)

给予：

  FPrate OMEGA
1  0.000 D-904
2  0.006 S-349
3  0.018 S-337
4  0.026 S-552
5  0.154 S-549
6  1.000 S-551

Answer 4

使用纯r-base的解决方案

xx            <- strsplit(x, " \\| ")
first.numbers <- sapply(xx, function(x) gsub("FPrate:", "", x[2]))
letters       <- sapply(xx, function(x) gsub("OMEGA:(.?)-\\d+", "\\1", x[[3]]))
last.digits   <- sapply(xx, function(x) gsub("OMEGA:.?-(\\d+)", "\\1", x[[3]]))

说明

如果您想坚持使用r-base，我意识到gsub在R语言中用途非常广泛。您甚至可以使用它来捕获组。

在此示例中，为使事情简单，我首先通过“ |”来strsplit：

xx <- strsplit(x, " \\| ", perl=TRUE)

如果您现在查看xx：

> xx
[[1]]
[1] ">104K_THEPA"  "FPrate:0.000" "OMEGA:D-904" 

[[2]]
[1] ">2MMP_ARATH"  "FPrate:0.006" "OMEGA:S-349" 

[[3]]
[1] ">5MMP_ARATH"  "FPrate:0.018" "OMEGA:S-337" 

[[4]]
[1] ">5NTD_DIPOM"  "FPrate:0.026" "OMEGA:S-552" 

[[5]]
[1] ">5NTD_HUMAN"  "FPrate:0.154" "OMEGA:S-549" 

[[6]]
[1] ">5NTD_MOUSE"  "FPrate:1.000" "OMEGA:S-551"

因此，您只能选择第二个或第三个元素，然后使用sapply在列表中传播（在这种情况下，该元素等效于unlist(lapply(...))，并在最后返回一个向量。

要捕获第一个数字，我会这样做：

first.numbers <- sapply(xx, function(x) gsub("FPrate:", "", x[2]))
first.numbers
## [1] "0.000" "0.006" "0.018" "0.026" "0.154" "1.000"

在这里，我刚刚删除了“ FPrate：”。我也可以通过分组来获取数字。我将在下一个捕获中使用它：

letters <- sapply(xx, function(x) gsub("OMEGA:(.?)-\\d+", "\\1", x[[3]]))
letters
## [1] "D" "S" "S" "S" "S" "S"

请注意，这里我用"OMEGA:(.?)-\\d+"匹配了第三个元素的整个表达式，但是用分组()捕获了一个位置（零个或一个，但是由于贪婪而需要一个）。有趣的是，我为替换整个表达式提供了什么："\\1"-第一组捕获的内容。因此，在gsub中，您可以使用对组\\1，\\2等的引用，具体取决于您添加的分组数量。

所以我们可以捕获最后的数字：

last.digits <- sapply(xx, function(x) gsub("OMEGA:.?-(\\d+)", "\\1", x[[3]]))
last.digits
## [1] "904" "349" "337" "552" "549" "551"

gsub()毕竟还不错，不是吗？

如何使用正则表达式从R

4 个答案:

基本R

gsubfn