Question

所以我有此URL列表，但是由于某些原因，我使用的regex表达式不会消除列表中的最后两个URL。

"https://www.homedepot.com/p/Champion-Power-Equipment-7500-Watt-Gasoline- Powered-Electric-Start-Portable-Generator-and-25-ft-Extension-Cord-100219/206268632"                 
"https://www.homedepot.com/p/Reliance-Controls-40-ft-30-Amp-Generator-Power-Cord-PC3040/202216500"                                                                            
"https://www.homedepot.com/p/Champion-Power-Equipment-25-ft-120-Volt-Generator-Power-Cord-48034/203501795"

我想消除任何包含“ cord”且不包含“ and”的URL。因此，最终我希望表达式仅返回第一个URL。我在完整列表中还有其他URL，这些URL不包含我要保留的“ cord”，因此，如果没有“ cord”和“ and”，我就无法消除所有内容。

.[!grepl("(?!.*and)(?=.*[Cc]ord)", ., perl = T)]

这是我一直在尝试的方法，但是它仍然返回所有三个URL。

任何帮助都会很棒。谢谢！

Answer 1

可能有一个更好的“单”正则表达式表达式，但这是一个分为两部分的解决方案

首先，确定所有包含“绳子”的

a <- c("https://www.homedepot.com/p/Champion-Power-Equipment-7500-Watt-Gasoline-Powered-Electric-Start-Portable-Generator-and-25-ft-Extension-Cord-100219/206268632",
       "https://www.homedepot.com/p/Reliance-Controls-40-ft-30-Amp-Generator-Power-Cord-PC3040/202216500",                                                                           
       "https://www.homedepot.com/p/Champion-Power-Equipment-25-ft-120-Volt-Generator-Power-Cord-48034/203501795")

library(stringr)

str_detect(a, regex('cord', ignore_case = T))
[1] TRUE TRUE TRUE

然后，标识所有包含“和”的

str_detect(a, regex('and', ignore_case = T))
[1]  TRUE FALSE FALSE

然后，我们将您的网址向量与所需的组合进行子集组合，在这种情况下，如果没有“ and”，就没有“ cord”

    a[str_detect(a, regex('cord', ignore_case = T)) &
           str_detect(a, regex('and', ignore_case = T))]
[1] "https://www.homedepot.com/p/Champion-Power-Equipment-7500-Watt-Gasoline-Powered-Electric-Start-Portable-Generator-and-25-ft-Extension-Cord-100219/206268632"

Answer 2

您要寻找的print "Prompt #48: are you tired of answering questions yet? [y/N]"; clearSTDIN(); $ans48 = <STDIN>; ... sub clearSTDIN { my $rin = ""; vec($rin, fileno(STDIN), 1) = 1; my ($found,$left) = select $rin,undef,undef,0; while ($found) { # $found is non-zero if there is any input waiting on STDIN my $waste = <STDIN>; # consume a line of STDIN ($found,$left) = select $rin,undef,undef,0; } seek STDIN,0,1; # clears eof flag on STDIN handle }我所用的导线在导线的两边均不含，因为导线可能在导线的前后。

您可以这样做：

^((?!and).)*Cord((?!and).)*$

或：

a[!grepl("^((?!and).)*Cord((?!and).)*$",a,ignore.case = T,perl=T)]
[1] "https://www.homedepot.com/p/Champion-Power-Equipment-7500-Watt-Gasoline-Powered-Electric-Start-Portable-Generator-and-25-ft-Extension-Cord-100219/206268632"

为了更快地进行操作，您可以决定不捕获该组，然后使用：

grep("^((?!and).)*Cord((?!and).)*$",a,ignore.case = T,perl=T,invert = T,value = T) [1] "https://www.homedepot.com/p/Champion-Power-Equipment-7500-Watt-Gasoline-Powered-Electric-Start-Portable-Generator-and-25-ft-Extension-Cord-100219/206268632"

^(?:(?!and).)*Cord(?:(?!and).)

Answer 3

我喜欢用我的思维方式写正则表达式。我已将问题分为两种情况。

情况1：您有一些数字，然后是“ and”，然后可能是一些数字，然后是“ cord”，然后可能是一些数字。情况2：您有几个数字，然后是“ case”，然后可能是一些数字，然后是“ and”，然后可能是一些数字。

将这些情况中的每一个都放在圆括号中，并在它们之间放置一个或()|()

为“ and”和“ case”的正则表达式添加一些区分大小写的功能。我只是使第一个字符不区分大小写，但是如果需要，您可以做更多的事情。

str_detect(a, "(.+[Cc]ord.*[Aa]nd.*)|(.+[Aa]nd.*[Cc]ord.*)")

其工作方式的示例和说明： [https://regexr.com/3snfq][1]

这全部假设您具有有效的链接，并且在“ and”或“ cord”之前必须有诸如“ http：//”之类的内容。

Answer 4

我不认为复杂的正则表达式是解决问题的方法，它在资源上更昂贵，可读性更差且不能推广到更多约束。

这是一个基本解决方案，是Onyambu基准测试的更新（还添加了我更正的Felipe解决方案）：

a[grepl("and",a) & grepl("cord",a,TRUE)]
# [1] "https://www.homedepot.com/p/Champion-Power-Equipment-7500-Watt-Gasoline-Powered-Electric-Start-Portable-Generator-and-25-ft-Extension-Cord-100219/206268632"

a = rep(a,1000)

microbenchmark::microbenchmark(
  # ad=str_detect(a, "(.+[Cc]ord.*[Aa]nd.*)|(.+[Aa]nd.*[Cc]ord.*)"), # too slow
  o1 = a[!grepl("^((?!and).)*Cord((?!and).)",a,ignore.case = T,perl=T)],
  o2 = a[!grepl("^(?:(?!and).)*Cord(?:(?!and).)",a,ignore.case = T,perl=T)],
  fe = a[!str_detect(a, regex('cord', ignore_case = T)) |
           str_detect(a, regex('and', ignore_case = T))],
  mm = a[grepl("and",a,perl = T) | !grepl("[Cc]ord",a,TRUE, perl=T)],
  unit = "relative"
)
# Unit: relative
#  expr      min       lq     mean   median       uq      max neval
#    o1 5.789945 5.735891 5.591111 5.767107 5.620815 5.863636   100
#    o2 5.338966 5.302216 5.187022 5.318472 5.210676 5.422635   100
#    fe 2.609088 2.777571 2.753290 2.838782 2.815973 2.664935   100
#    mm 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100

正则表达式用于包含一个表达式并排除其他问题的字符串

4 个答案: