Question

使用NOAA Severe Weather data，其中包含描述天气事件类型的变量EVTYPE（事件类型）。这些值包括许多我想在几个更广泛的名称下收集的同义词。例如，TORNADO以及ROTATING WALL CLOUD，FUNNEL CLOUD和WHIRLWIND在某种意义上描述相对类似的事件。在没有深入了解气象的微妙之处的情况下，我想在这个名称下合并几乎同义的值。

所以，让我们说我在数据框noaa_clean中加载了数据集，并将其应用于此：

tornado <- sapply(as.character(noaa_clean$EVTYPE), 
                   function(x){grepl("^.*TORNAD.*$", x) |
                               grepl("^.*SPOUT.*$", x) |
                               grepl("^.*WHIRL.*$", x) |
                               grepl("^.*FUNNEL.*$", x) |
                               grepl("^.*ROTATING WALL CLOUD.*$", x) |
                               grepl("^.*DUST DEVIL.*$", x)})
noaa_clean[tornado, "EVCAT"] <- "TORNADO"; rm(tornado)

它运作良好，但我有其中的几个，需要一些时间（约5-10分钟）来运行它们。我的问题是：有没有更好的方法来利用grepl()或正则表达式来提高效率？

Answer 1

由于您特别询问了速度，因此对评论中发布的各种解决方案或作为答案的测试是：

#Initialize vector
x <- sample(c("TORNA", "SPOUT", "WHIRL", "FUNNEL", "ROTATING WALL CLOUD", "DUST DEVIL",
                LETTERS[1:8]), 1e6, replace = TRUE)

#Using separate grepl's
multi_grepl <- function(x) {grepl("TORNAD", x) |grepl("SPOUT", x) |grepl("WHIRL", x) |grepl("FUNNEL", x) | grepl("ROTATING WALL CLOUD", x) |grepl("DUST DEVIL", x)}

#One grepl
one_grepl <- function(x) grepl("TORNAD|SPOUT|WHIRL|FUNNEL|ROTATING WALL CLOUD|DUST DEVIL", x)

#Using stri_detect_regex
detect_regex <- function(x) stringi::stri_detect_regex(x, "TORNAD|SPOUT|WHIRL|FUNNEL|ROTATING WALL CLOUD|DUST DEVIL")

#Original solution with sapply
orig_sapply <- function(x) sapply(x, function(y){grepl("^.*TORNAD.*$", y) |grepl("^.*SPOUT.*$", y) |grepl("^.*WHIRL.*$", y) |grepl("^.*FUNNEL.*$", y) |grepl("^.*ROTATING WALL CLOUD.*$", y) |grepl("^.*DUST DEVIL.*$", y)})

#Using stri_detect_fixed
stri_fixed = function(x) { stri_detect_fixed(x, pattern = "TORNAD") | stri_detect_fixed(x, pattern = "SPOUT") | stri_detect_fixed(x, pattern = "WHIRL") | stri_detect_fixed(x, pattern = "FUNNEL") | stri_detect_fixed(x, pattern = "ROTATING WALL CLOUD") | stri_detect_fixed(x, pattern = "DUST DEVIL") }


#Checking that all these give same answer
identical(multi_grepl(x), one_grepl(x), detect_regex(x), orig_sapply(x), stri_fixed(x))
#[1] TRUE

microbenchmark::microbenchmark(multi_grepl(x),
                               one_grepl(x),
                               detect_regex(x),
                               orig_sapply(x),
                               stri_fixed(x), times = 20L)

#Unit: milliseconds
#            expr        min         lq       mean     median         uq        max neval
#  multi_grepl(x)   724.6716   738.5227   754.2347   747.1441   769.2897   819.9971    20
#    one_grepl(x)   406.7987   410.3197   420.0083   412.1168   426.5932   453.2471    20
# detect_regex(x)   167.4844   170.0834   174.1256   172.7410   177.1546   187.3211    20
#  orig_sapply(x) 47172.3407 47379.8250 47666.7177 47546.2221 47875.9352 48517.2228    20
#   stri_fixed(x)   261.4303   265.9189   270.5816   268.6038   273.2486   288.7071    20

似乎stri_detect_regex是最快的。有趣的是，这改变了我在^.*中.*$和regex时尝试的最后一次迭代。感谢@Gregor指出这一点。请注意，原始sapply非常慢，因为它多次执行grepl搜索（每个元素一次）。而不只是整个载体一次。

最后，更长的单个字符串的结果：

prefixes <- replicate(1e6, paste0(sample(LETTERS, sample(100:200), replace = TRUE), collapse = ""))
suffixes <- replicate(1e6, paste0(sample(LETTERS, sample(200:300), replace = TRUE), collapse = ""))
x_long <- paste0(prefixes, x, suffixes)

microbenchmark::microbenchmark(multi_grepl(x_long),
                               one_grepl(x_long),
                               detect_regex(x_long),
                               stri_fixed(x_long), times = 20L)

#Unit: seconds
#                 expr       min        lq      mean    median        uq       max neval
#  multi_grepl(x_long) 27.654274 27.721042 28.194273 27.962656 28.626697 29.909105    20
#    one_grepl(x_long) 11.478831 11.510868 11.775088 11.583650 11.663479 14.318680    20
# detect_regex(x_long)  8.673534  8.729508  8.808797  8.774432  8.878907  9.028005    20
#   stri_fixed(x_long)  4.502196  4.540850  4.609050  4.591879  4.690035  4.750445    20

Answer 2

正则表达式本身可以使用|作为OR匹配。你可以做到

tornado  <- grepl("(TORNAD|SPOUT|WHIRL|FUNNEL|ROTATING WALL CLOUD|DUST DEVIL)", as.character(noaa_clean$EVTYPE))

另请注意，我们不需要使用sapply()，因为grepl已经是R中的矢量化函数。

R - 是否可以优化或简化对grepl（）的多次调用？

2 个答案: