Question

tl; dr ：如何改造＆＃34; 2015年10月30日09：00：00＆＃34;进入＆＃34; 0900＆＃34;使用sub？

我在问题末尾有以下数据框df（dput）：

str(df)
'data.frame':   75 obs. of  2 variables:
 $ V1: chr  "10/30/2015 09:00:00" "10/30/2015 09:01:00" "10/30/2015 09:02:00" "10/30/2015 09:03:00" ...
 $ V2: num  22443 22553 22578 22565 22574 ...

我感兴趣的专栏是第一个，字符串格式如下：

df[1,1]
[1] "10/30/2015 09:00:00" # all the days are the same, the time are always different

目标是创建一个向量或用此替换df[,1]：

[1] "0900" "0901" "0902" "0903" "0904" ... "1000" "1001" # character format

或（最好）由此：

[1]  900  901  902  903  904 1000 1001 # numeric format

尽可能以最快的方式使用。

我现在拥有的是：

temp<-sapply(strsplit(df[,1],' '), "[", 2)
final<-paste0(substr(temp,1,2),substr(temp,4,5))

返回：

final
[1] "0900" "0901" "0902"

但这个解决方案效率不高。我看了sub这允许我这样做：

temp2<-sub(".*\\s","",df[,1])
[1] "09:00:00" "09:01:00" "09:02:00"

然后我可以使用paste0(substr(temp2,1,2),substr(temp2,4,5))

但我想知道是否有可能创建一个模式，它允许我使用sub并直接返回预期的输出，而不必使用不那么漂亮的{{ 1}}。我无法创建一个将附加小时和分钟并删除其余部分的内容。我也尝试使用paste0(substr())，但它比我的第一个解决方案慢。

以下是strftime(as.POSIXct(df[,1],format="%m/%d/%Y %H:%M"), format="%H%M")：

dput

Answer 1

如果您确定所有输入都是已知格式，则可以使用

sub("^\\S+\\s+(\\d+):(\\d+).*$","\\1\\2", s)

\\S+子模式匹配1个或多个非空白字符，而.*匹配0或更多字符而不是换行符（贪婪，但这里没关系，因为我们匹配其余的行直到最后 - 我假设输入没有换行符号。

请参阅IDEONE demo

如果您只需要处理匹配dd/MM/yyyy hh:mm:ss格式的字符串（与其他格式一样），请使用

sub("^\\d+(?:/\\d+){2}\\s+(\\d+):(\\d+):\\d+$","\\1\\2", s)

说明：

^ - 字符串开头
\\d+ - 一位或多位
(?:/\\d+){2} - 2次出现（由于限制量词{2}）斜线后跟1位或更多位数
\\s+ - 一个或多个空白字符
(\\d+) - （我们将使用\\1反向引用的第1组）1位或更多位数
: - 字面冒号
(\\d+) - (Group 1 that we'll backreference to with \ 1`）1位或更多位数
:\\d+ - 后跟一个或多个数字的冒号（但我们不会捕获它们，因为我们无需保留它们）
$ - 字符串结尾

请参阅此IDEONE demo

基本上，该技术是匹配整个字符串，捕获（使用捕获组 (...)）我们需要保留的内容，并在替换模式中使用反向引用（如{ {1}}其中 n 是捕获组索引）到捕获的子字符串。

Answer 2

如果格式是固定的，那么我们可以使用substr：

as.numeric(
  paste0(substr(df$V1, 12, 13),
         substr(df$V1, 15, 16)))

基准：

library(microbenchmark)
microbenchmark(
  substr={
    as.numeric(
      paste0(substr(df$V1, 12, 13),
             substr(df$V1, 15, 16)))
  },
  sub={
    as.numeric(sub("^\\d+(?:/\\d+){2}\\s+(\\d+):(\\d+):\\d+$",
                   "\\1\\2",
                   df$V1))
  },
  strsplit={
    temp <- sapply(strsplit(df[,1],' '), "[", 2)
    as.numeric(paste0(substr(temp,1,2),substr(temp,4,5)))
  },
  times=1000)

Unit: microseconds
     expr     min      lq      mean  median      uq      max neval cld
   substr  46.786  50.711  61.08613  52.220  54.031 6657.496  1000 a  
      sub 127.078 132.813 139.43847 135.831 141.264  251.136  1000  b 
 strsplit 143.679 151.829 162.15411 157.866 166.016  331.426  1000   c

从日期附加小时和分钟 - 最有效的方式

2 个答案: