我的数据框“B”中有以下列“检查”,其中包含不同行的输入结果。这些语句有一个变量'abc',对应它们也有一个值条目。 完成的条目是手动的,并且对于每个条目都不一致。我必须提取'abc',然后是'value'
< B$checks
rows Checks
[1] there was no problem reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue
[2] abc(107 to 109) xyz 115 jbo xyz 104 optim
[3] problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem
[4] abc_107 xyz 116 dor problem
[5] surevy done , no approximation issues abc 103 xyz 109 crux xyz 104
[6] ping test ok abc(86 rxlevel 84
[7] field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL No Building class Residential Building Type Multi
[8] abc 89 xyz 99 so as the user has no problem , check ping test
预期输出
rows Variable Value
[1] abc 96
[2] abc 107
[3] abc 95
[4] abc 107
[5] abc 103
[6] abc 86
[7] abc 86
[8] abc 89
我在类似查询下使用引用尝试了以下内容
usisng str_match
library(stringr)
m1 <- str_match(B$checks, "abc.*?([0-200.]{1,})") # value is between 0 to 200
产生了类似下面的东西
row var value
1 abc-96 xyz 450 0
2 abc(10 10
3 abc 95 1 1
4 abc_10 10
5 abc 10 10
6 NA NA
7 NA NA
8 NA NA
然后我尝试了以下
B$Checks <- gsub("-", " ", B$Checks)
B$Checks <- gsub("/", " ", B$Checks)
B$Checks <- gsub("_", " ", B$Checks)
B$Checks <- gsub(":", " ", B$Checks)
B$Checks <- gsub(")", " ", B$Checks)
B$Checks <- gsub("((((", " ", B$Checks)
B$Checks <- gsub(".*abc", "abc", B$Checks)
B$Checks <- gsub("[[:punct:]]", " ", B$Checks)
regexp <- "[[:digit:]]+"
m <- str_extract(B$Checks, regexp)
m <- as.data.frame(m)
并且能够获得“预期输出”,
但现在我正在寻找以下
1)更简单的命令集或提取预期输出的方法
2)获取表示为范围的值,例如我想要下面的输入行
rows Checks
[2] abc(107 to 109) xyz 115 jbo xyz 104 optim
as
输出&gt;
rows Variable Value1 Value2
[2] abc 107 109
需要1)和2)的解决方案,因为我正在处理具有相同模式和大量混合变量值组合的大型数据集。
提前致谢。
答案 0 :(得分:3)
您需要捕获数字,指定您希望{look}的数字之前的abc
:
Value <- sub(".*(?<=abc)(\\D+)?(\\d*)\\D?.*", "\\2", str, perl=TRUE)
# Value
#[1] "96" "107" "95" "107" "103" "86" "86" "89"
然后,您可以将值放在data.frame
:
B <- data.frame(Variable="abc", Value=as.numeric(Value))
head(B, 3)
# Variable Value
#1 abc 96
#2 abc 107
#3 abc 95
数据强>
str <- c("there was no problem reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue",
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem",
"abc_107 xyz 116 dor problem", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ",
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL No Building class Residential Building Type Multi",
"abc 89 xyz 99 so as the user has no problem , check ping test")
答案 1 :(得分:0)
使用gsub()两次和magrittr以获得更好的可读性:
library(magrittr)
data.frame(
Variable = "abc",
Value = data %>%
gsub(".*(abc.{6}).*", "\\1", .) %>%
gsub("[^0-9]+(\\d+).*", "\\1", .)
)
Variable Value
1 abc 96
2 abc 107
3 abc 95
4 abc 107
5 abc 103
6 abc 86
7 abc 86
8 abc 89
首先我们获取extract abc和接下来的6个字符,然后提取出现的第一个整数。
数据强>:
data <- c("there was no problem reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue",
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem",
"abc_107 xyz 116 dor problem ", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ",
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL No Building class Residential Building Type Multi",
"abc 89 xyz 99 so as the user has no problem , check ping test"
)
答案 2 :(得分:0)
使用stringr
来操纵字符串,使用rebus
来编写可读的正则表达式:
library(stringr)
library(rebus)
str_match(checks, pattern = capture("abc") %R% optional(or1(c(SPC, PUNCT))) %R% capture(one_or_more(DGT)))
<强>输出:强>
[,1] [,2] [,3]
[1,] "abc-96" "abc" "96"
[2,] "abc(107" "abc" "107"
[3,] "abc 95" "abc" "95"
[4,] "abc_107" "abc" "107"
[5,] "abc 103" "abc" "103"
[6,] "abc(86" "abc" "86"
[7,] "abc-86" "abc" "86"
[8,] "abc 89" "abc" "89"
数据:强>
checks <- c("there was no problem reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue",
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem",
"abc_107 xyz 116 dor problem", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ",
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL No Building class Residential Building Type Multi",
"abc 89 xyz 99 so as the user has no problem , check ping test")