如何使用stringr基于前面的模式从字符串中提取数字?

时间:2017-05-05 03:38:41

标签: r dplyr stringr

我想提取HBA1C的值。这些值出现在模式" HBA1C ="在数据框X2的文本变量df中。模式可以出现在字符串的开头,如第2,3和6行,也可以出现在第4行的中间。

df<-data.frame(X1=1:6,X2=c(NA,"HBA1C = 8.9 (09/06/15)","HBA1C = 9.8 (03/08/15)",
                           "JUN 2014, WAS ON LANTUS AND APIDARA HBA1C = 6.2 (21/7/15), 
                           NEHR LOCKED. 18/8/15","SLIDING SCALE FOLLOWED STRICTLY",
                           "HBA1C = 11.7 (17/7/15)"))

# df
#  X1                                                                              X2
#1  1                                                                            <NA>
#2  2                                                          HBA1C = 8.9 (09/06/15)
#3  3                                                          HBA1C = 9.8 (03/08/15)
#4  4 JUN 2014, WAS ON LANTUS AND APIDARA HBA1C = 6.2 (21/7/15), NEHR LOCKED. 18/8/15
#5  5                                                 SLIDING SCALE FOLLOWED STRICTLY
#6  6                                                          HBA1C = 11.7 (17/7/15)

我想要提取的这些值应保存在新变量X3中,如下所示:

# df
#  X1                                                                              X2   X3
#1  1                                                                            <NA>   NA
#2  2                                                          HBA1C = 8.9 (09/06/15)  8.9
#3  3                                                          HBA1C = 9.8 (03/08/15)  9.8
#4  4 JUN 2014, WAS ON LANTUS AND APIDARA HBA1C = 6.2 (21/7/15), NEHR LOCKED. 18/8/15  6.2
#5  5                                                 SLIDING SCALE FOLLOWED STRICTLY   NA
#6  6                                                          HBA1C = 11.7 (17/7/15) 11.7

我尝试了以下代码,但它不起作用。

library(stringr)
df1$X3 <- 
str_extract(str_extract(df$X2,pattern = "HBA1C = [0-9].[0-9]"),pattern = "[0-9].[0-9]")

我收到了这个错误:

  

df $ X2中的错误:类型&#39;关闭&#39;的对象不是子集表格

1 个答案:

答案 0 :(得分:4)

我们可以使用带有正则表达式外观的单str_extract

df$X3 <- as.numeric(str_extract(df$X2,pattern = "(?<=HBA1C \\= )[0-9]+\\.[0-9]+"))
df$X3
#[1]   NA  8.9  9.8  6.2   NA 11.7

pattern匹配是一个或多个数字([0-9]+),后跟一个.,后跟一个或多个数字后面的单词&#39; HBA1C&#39;后跟一个空格=和空格

注意:某些字符是元,即它们被正则表达式引擎不同地感知,例如.它意味着任何字符而不是文字点(.)。因此,对于这些情况,我们必须逃避(\\)或将其放在方括号内[.]