使用R中的正则表达式从字符串中获取数字

时间:2018-03-26 21:05:02

标签: r regex

所以正则表达式是我一直在努力/从未花费适当时间学习的东西。在这种情况下,我有一个R矢量字符串,其中包含以这种格式的棒球数据:

hit_vector = c("", "Batted ball speed <b>104 mph</b>; distance of <b>381 
feet</b>; launch angle of <b>38 degrees</b>.", 
"Ball was hit at <b>67 mph</b>.", "", "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>.", 
"Batted ball speed <b>71 mph</b>.", "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>.", 
"", "", "Batted ball speed <b>64 mph</b>.")  

> hit_vector
 [1] ""                                                                                                       
 [2] "Batted ball speed <b>104 mph</b>; distance of <b>381 feet</b>; launch angle of <b>38 degrees</b>."
 [3] "Ball was hit at <b>67 mph</b>."                                                                         
 [4] ""                                                                                                       
 [5] "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>."                        
 [6] "Batted ball speed <b>71 mph</b>."                                                                       
 [7] "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>."                         
 [8] ""                                                                                                       
 [9] ""                                                                                                       
[10] "Batted ball speed <b>64 mph</b>."  

我正在尝试创建一个包含10行的数据框,如下所示:

hit_dataframe
    speed   distance   degrees
1.     NA         NA        NA
2.    104        381        38
3.     67         NA        NA
4.     NA         NA        NA
5.    107        412        NA
6.     71         NA        NA
7.     94        287        NA
8.     NA         NA        NA
9.     NA         NA        NA
10.    64         NA        NA

整个hit_vector要长得多,但似乎它们都遵循这个命名约定。

编辑:看起来以下内容有助于识别某些信息,但这些行并不完美(第三行返回所有FALSE,这不是正确的):

grepl("[0-9]{1,3} mph", hit_vector)
grepl("[0-9]{1,3} feet", hit_vector)
grepl("[0-9]{1,3} degrees", hit_vector)

Edit2:我不确定每个统计数字的位数。例如,mph可能超过100(3位),也小于10(1位)。

4 个答案:

答案 0 :(得分:11)

使用基数r:

read.table(text=gsub("\\D+"," ",hit_vector),fill=T,blank.lines.skip = F)

    V1  V2 V3
1   NA  NA NA
2  104 381 38
3   67  NA NA
4   NA  NA NA
5  107 412 NA
6   71  NA NA
7   94 287 NA
8   NA  NA NA
9   NA  NA NA
10  64  NA NA

在这里,只需删除非数字的所有内容,即\\D+,然后使用FILL=T读取数据并且不跳过

考虑到您在下面发表的评论,我们需要重新安排数据:

hit_vector1=c(hit_vector,"traveled a distance of <b>412 feet</b>.")

#Take the numbers together with their respective measurements.
a=gsub(".*?(\\d+).*?(mph|feet|degree).*?"," \\1 \\2",hit_vector1)

#Remove the </b>
b=sub("<[/]b>.","",a)

## Any element that does not contain the measurements, invoke an NA
fun=function(x){y=-grep(x,b);b<<-replace(b,y,paste(b[y],NA,x))}
invisible(sapply(c("mph","feet","degrees"),fun))

## Break the line after each measurement and read in a table format
e=gsub("([a-z])\\s","\\1\n",b)
unstack(read.table(text=e))
      degrees feet mph
1       NA   NA  NA
2       38  381 104
3       NA   NA  67
4       NA   NA  NA
5       NA  412 107
6       NA   NA  71
7       NA  287  94
8       NA   NA  NA
9       NA   NA  NA
10      NA   NA  64
11      NA  412  NA

答案 1 :(得分:10)

str_extract包中的stringr函数在这里应该很有用:

data.frame(
    speed=str_extract(hit_vector, "(\\d+)(?=\\s+mph)"),
    distance=str_extract(hit_vector, "(\\d+)(?=\\s+feet)"),
    degrees=str_extract(hit_vector, "(\\d+)(?=\\s+degrees)")
)

#    speed distance degrees
# 1   <NA>     <NA>    <NA>
# 2    104      381      38
# 3     67     <NA>    <NA>
# 4   <NA>     <NA>    <NA>
# 5    107      412    <NA>
# 6     71     <NA>    <NA>
# 7     94      287    <NA>
# 8   <NA>     <NA>    <NA>
# 9   <NA>     <NA>    <NA>
# 10    64     <NA>    <NA>

\\d是数字的字符类,因此\\d+匹配一个或多个数字。 (?=)是零宽度前瞻操作符,因此在这种情况下,它匹配模式后跟零个或多个空格字符(\\s+)和mphfeetdegrees,没有捕获这些字符串。

答案 2 :(得分:2)

如果你不介意红墨水:

library(tidyverse)
tibble(x=hit_vector) %>%
  separate(x,c("speed","distance","degrees"),"</b>") %>%
  mutate_all(parse_number)

# # A tibble: 10 x 3
#    speed distance degrees
#    <dbl>    <dbl>   <dbl>
#  1    NA       NA      NA
#  2   104      381      38
#  3    67       NA      NA
#  4    NA       NA      NA
#  5   107      412      NA
#  6    71       NA      NA
#  7    94      287      NA
#  8    NA       NA      NA
#  9    NA       NA      NA
# 10    64       NA      NA

答案 3 :(得分:0)

另一个基地R(使用regmatches):

# list of patterns
patterns <- c("(\\d+)(?=\\s*mph)", "(\\d+)(?=\\s*feet)", "(\\d+)(?=\\s*degrees)")

results <- lapply(patterns, function(pattern) {
  unlist(lapply(hit_vector, function(item) {
    result <- as.numeric(regmatches(item, regexpr(pattern, item, perl = TRUE)))
    if (identical(result, numeric(0))) return(NA)
    else return(result)
  }))
})

# build the dataframe from the list
df <- as.data.frame(do.call(cbind, results))
colnames(df) <- c("speed", "distance", "degrees")

<小时/> 或者(反过来):

result <- lapply(hit_vector, function(string) {
  unlist(lapply(patterns, function(pattern) {
    result <- as.numeric(regmatches(string, regexpr(pattern, string, perl = TRUE)))
    if (identical(result, numeric(0))) return(NA)
    else return(result)
  }))
})

df <- as.data.frame(do.call(rbind, result2))
colnames(df) <- c("speed", "distance", "degrees", "raw")

<小时/> 两者都会产生

   speed distance degrees
1     NA       NA      NA
2    104      381      38
3     67       NA      NA
4     NA       NA      NA
5    107      412      NA
6     71       NA      NA
7     94      287      NA
8     NA       NA      NA
9     NA       NA      NA
10    64       NA      NA