我想从R中的字符向量中提取数字信息。向量中的每一行都具有相同的结构,如下所示:
[1] "Capturing tweets..."
[2] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
[3] "Capturing tweets..."
[4] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
[5] "Capturing tweets..."
[6] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
[7] "Capturing tweets..."
[8] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
[9] "Capturing tweets..."
如您所见,此向量中有两种重复出现的数字信息。一个概述了打开连接的持续时间,即数字后跟“秒”,另一个表示下载的推文数量。我只需要推文的数量,所以我想生成一个新的数字向量,其中只包含每行后面跟着“推文”的数字。
答案 0 :(得分:4)
你的正则表达式必须是,
as.numeric(sub(".*?(\\d+) tweets.*","\\1",x))
.*
之后的tweets
是非常需要的,以便删除推文旁边存在的所有字符。
x <- c("Capturing tweets...", "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded.")
as.numeric(sub(".*?(\\d+) tweets.*","\\1",grep("\\d+ tweets", x, value=TRUE)))
# [1] 1
为什么我使用.*?
代替.*
?
因为.*
是贪婪的,它匹配到最后一个所有字符。然后它按顺序回溯以找到匹配。所以它回溯(反向遍历)直到tweets
之前存在的数字,它停止捕获第一个数字旁边的所有数字,因为\\d+
(至少一位数。所以它找到匹配)。现在它不会回到第二个字符,因为条件满足\\d+
,匹配一个或多个数字字符。