我想提取HBA1C的值。这些值出现在模式" HBA1C ="在数据框X2
的文本变量df
中。模式可以出现在字符串的开头,如第2,3和6行,也可以出现在第4行的中间。
df<-data.frame(X1=1:6,X2=c(NA,"HBA1C = 8.9 (09/06/15)","HBA1C = 9.8 (03/08/15)",
"JUN 2014, WAS ON LANTUS AND APIDARA HBA1C = 6.2 (21/7/15),
NEHR LOCKED. 18/8/15","SLIDING SCALE FOLLOWED STRICTLY",
"HBA1C = 11.7 (17/7/15)"))
# df
# X1 X2
#1 1 <NA>
#2 2 HBA1C = 8.9 (09/06/15)
#3 3 HBA1C = 9.8 (03/08/15)
#4 4 JUN 2014, WAS ON LANTUS AND APIDARA HBA1C = 6.2 (21/7/15), NEHR LOCKED. 18/8/15
#5 5 SLIDING SCALE FOLLOWED STRICTLY
#6 6 HBA1C = 11.7 (17/7/15)
我想要提取的这些值应保存在新变量X3
中,如下所示:
# df
# X1 X2 X3
#1 1 <NA> NA
#2 2 HBA1C = 8.9 (09/06/15) 8.9
#3 3 HBA1C = 9.8 (03/08/15) 9.8
#4 4 JUN 2014, WAS ON LANTUS AND APIDARA HBA1C = 6.2 (21/7/15), NEHR LOCKED. 18/8/15 6.2
#5 5 SLIDING SCALE FOLLOWED STRICTLY NA
#6 6 HBA1C = 11.7 (17/7/15) 11.7
我尝试了以下代码,但它不起作用。
library(stringr)
df1$X3 <-
str_extract(str_extract(df$X2,pattern = "HBA1C = [0-9].[0-9]"),pattern = "[0-9].[0-9]")
我收到了这个错误:
df $ X2中的错误:类型&#39;关闭&#39;的对象不是子集表格
答案 0 :(得分:4)
我们可以使用带有正则表达式外观的单str_extract
df$X3 <- as.numeric(str_extract(df$X2,pattern = "(?<=HBA1C \\= )[0-9]+\\.[0-9]+"))
df$X3
#[1] NA 8.9 9.8 6.2 NA 11.7
pattern
匹配是一个或多个数字([0-9]+
),后跟一个.
,后跟一个或多个数字后面的单词&#39; HBA1C&#39;后跟一个空格=
和空格
注意:某些字符是元,即它们被正则表达式引擎不同地感知,例如.
它意味着任何字符而不是文字点(.
)。因此,对于这些情况,我们必须逃避(\\
)或将其放在方括号内[.]