如何提取特定字符串及其相应的数值?

时间:2018-05-23 11:35:00

标签: r regex stringr

我的数据框“B”中有以下列“检查”,其中包含不同行的输入结果。这些语句有一个变量'abc',对应它们也有一个值条目。 完成的条目是手动的,并且对于每个条目都不一致。我必须提取'abc',然后是'value'

< B$checks

    rows    Checks
    [1] there was no problem  reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue
    [2] abc(107 to 109) xyz 115 jbo xyz 104 optim
    [3] problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem
    [4] abc_107 xyz 116 dor problem 
    [5] surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 
    [6] ping test ok abc(86 rxlevel 84
    [7] field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL  No Building class Residential Building Type Multi
    [8] abc 89 xyz 99 so as the user has no problem , check ping test

预期输出

rows    Variable    Value
        [1] abc 96
        [2] abc 107
        [3] abc 95
        [4] abc 107
        [5] abc 103
        [6] abc 86
        [7] abc 86
        [8] abc 89

我在类似查询下使用引用尝试了以下内容

usisng str_match

library(stringr)
m1 <- str_match(B$checks, "abc.*?([0-200.]{1,})")  # value is between 0 to 200

产生了类似下面的东西

    row var value
1   abc-96 xyz 450  0
2   abc(10  10
3   abc 95 1    1
4   abc_10  10
5   abc 10  10
6   NA  NA
7   NA  NA
8   NA  NA

然后我尝试了以下

B$Checks <- gsub("-", " ", B$Checks)
B$Checks <- gsub("/", " ", B$Checks)
B$Checks <- gsub("_", " ", B$Checks)
B$Checks <- gsub(":", " ", B$Checks)
B$Checks <- gsub(")", " ", B$Checks)
B$Checks <- gsub("((((", " ", B$Checks)
B$Checks <- gsub(".*abc", "abc", B$Checks) 
B$Checks <- gsub("[[:punct:]]", " ", B$Checks)
regexp <- "[[:digit:]]+"   
m <- str_extract(B$Checks, regexp) 
m <- as.data.frame(m)

并且能够获得“预期输出”,

但现在我正在寻找以下

1)更简单的命令集或提取预期输出的方法

2)获取表示为范围的值,例如我想要下面的输入行

rows    Checks
[2] abc(107 to 109) xyz 115 jbo xyz 104 optim

as

输出&gt;

rows    Variable    Value1 Value2
 [2]     abc        107   109

需要1)和2)的解决方案,因为我正在处理具有相同模式和大量混合变量值组合的大型数据集。

提前致谢。

3 个答案:

答案 0 :(得分:3)

您需要捕获数字,指定您希望{look}的数字之前的abc

Value <- sub(".*(?<=abc)(\\D+)?(\\d*)\\D?.*", "\\2", str, perl=TRUE)
# Value
#[1] "96"  "107" "95"  "107" "103" "86"  "86"  "89"

然后,您可以将值放在data.frame

B <- data.frame(Variable="abc", Value=as.numeric(Value))
head(B, 3)
#  Variable Value
#1      abc    96
#2      abc   107
#3      abc    95

数据

str <- c("there was no problem  reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue", 
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem", 
"abc_107 xyz 116 dor problem", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ", 
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL  No Building class Residential Building Type Multi", 
"abc 89 xyz 99 so as the user has no problem , check ping test")

答案 1 :(得分:0)

使用gsub()两次和magrittr以获得更好的可读性:

library(magrittr)

data.frame(
  Variable = "abc",
  Value = data %>%
    gsub(".*(abc.{6}).*", "\\1", .) %>%
    gsub("[^0-9]+(\\d+).*", "\\1", .)
)
  Variable Value
1      abc    96
2      abc   107
3      abc    95
4      abc   107
5      abc   103
6      abc    86
7      abc    86
8      abc    89

首先我们获取extract abc和接下来的6个字符,然后提取出现的第一个整数。

数据

data <- c("there was no problem  reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue", 
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem", 
"abc_107 xyz 116 dor problem ", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ", 
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL  No Building class Residential Building Type Multi", 
"abc 89 xyz 99 so as the user has no problem , check ping test"
)

答案 2 :(得分:0)

使用stringr来操纵字符串,使用rebus来编写可读的正则表达式:

library(stringr)
library(rebus)
str_match(checks, pattern = capture("abc") %R% optional(or1(c(SPC, PUNCT))) %R% capture(one_or_more(DGT)))

<强>输出:

     [,1]      [,2]  [,3] 
[1,] "abc-96"  "abc" "96" 
[2,] "abc(107" "abc" "107"
[3,] "abc 95"  "abc" "95" 
[4,] "abc_107" "abc" "107"
[5,] "abc 103" "abc" "103"
[6,] "abc(86"  "abc" "86" 
[7,] "abc-86"  "abc" "86" 
[8,] "abc 89"  "abc" "89"

数据:

checks <- c("there was no problem  reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue", 
            "abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem", 
            "abc_107 xyz 116 dor problem", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ", 
            "ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL  No Building class Residential Building Type Multi", 
            "abc 89 xyz 99 so as the user has no problem , check ping test")