Question

我想从R中的字符向量中提取数字信息。向量中的每一行都具有相同的结构，如下所示：

  [1] "Capturing tweets..."                                                                    
  [2] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
  [3] "Capturing tweets..."                                                                    
  [4] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
  [5] "Capturing tweets..."                                                                    
  [6] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
  [7] "Capturing tweets..."                                                                    
  [8] "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded."
  [9] "Capturing tweets..."

如您所见，此向量中有两种重复出现的数字信息。一个概述了打开连接的持续时间，即数字后跟“秒”，另一个表示下载的推文数量。我只需要推文的数量，所以我想生成一个新的数字向量，其中只包含每行后面跟着“推文”的数字。

Answer 1

你的正则表达式必须是，

as.numeric(sub(".*?(\\d+) tweets.*","\\1",x))

.*之后的tweets是非常需要的，以便删除推文旁边存在的所有字符。

x <- c("Capturing tweets...", "Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded.")
as.numeric(sub(".*?(\\d+) tweets.*","\\1",grep("\\d+ tweets", x, value=TRUE)))
# [1] 1

为什么我使用.*?代替.*？

因为.*是贪婪的，它匹配到最后一个所有字符。然后它按顺序回溯以找到匹配。所以它回溯（反向遍历）直到tweets之前存在的数字，它停止捕获第一个数字旁边的所有数字，因为\\d+（至少一位数。所以它找到匹配）。现在它不会回到第二个字符，因为条件满足\\d+，匹配一个或多个数字字符。

R - 基于跟随数字的重复字符串从字符向量中提取数字数据

1 个答案: