R - 基于跟随数字的重复字符串从字符向量中提取数字数据

时间:2015-10-01 12:02:37

标签: regex r twitter vector

我想从R中的字符向量中提取数字信息。向量中的每一行都具有相同的结构,如下所示:

  [1] "Capturing tweets..."                                                                    
  [2] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
  [3] "Capturing tweets..."                                                                    
  [4] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
  [5] "Capturing tweets..."                                                                    
  [6] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
  [7] "Capturing tweets..."                                                                    
  [8] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
  [9] "Capturing tweets..." 

如您所见,此向量中有两种重复出现的数字信息。一个概述了打开连接的持续时间,即数字后跟“秒”,另一个表示下载的推文数量。我只需要推文的数量,所以我想生成一个新的数字向量,其中只包含每行后面跟着“推文”的数字。

1 个答案:

答案 0 :(得分:4)

你的正则表达式必须是,

as.numeric(sub(".*?(\\d+) tweets.*","\\1",x))
.*之后的tweets是非常需要的,以便删除推文旁边存在的所有字符。

x <- c("Capturing tweets...", "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded.")
as.numeric(sub(".*?(\\d+) tweets.*","\\1",grep("\\d+ tweets", x, value=TRUE)))
# [1] 1

为什么我使用.*?代替.*

因为.*是贪婪的,它匹配到最后一个所有字符。然后它按顺序回溯以找到匹配。所以它回溯(反向遍历)直到tweets之前存在的数字,它停止捕获第一个数字旁边的所有数字,因为\\d+(至少一位数。所以它找到匹配)。现在它不会回到第二个字符,因为条件满足\\d+,匹配一个或多个数字字符。