Question

我有这个正则表达式：

INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|polo|earrings?|plush|pacifier|tie$|panties|boxers?|slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|battstation|tea|pocket ref|pajamas?|boyshorts?|mimopowertube|coat|bathrobe)\b/i

并且它以这种方式工作......但我想写这样的东西：

INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|
                    cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|
                    polo|earrings?|plush|pacifier|tie$|panties|boxers?|
                    slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|
                    battstation|tea|pocket ref|pajamas?|boyshorts?|
                    mimopowertube|coat|bathrobe)\b/i

但如果我使用第二个选项，则不会使用以下单词：cufflink，polo，slippers？，battstation和mimopowertube ....因为该单词之前有空格，例如：

(this space before the word)cufflink

我会非常感激任何帮助。

Answer 1

你可以使用这样的东西

INVALID_NAMES = [
  "bib$",
  "costumes$",
  "httpanties?",
  "necklace"
]
INVALID_NAMES_REGEX = /\b(#{INVALID_NAMES.join '|'})\b/i
p INVALID_NAMES_REGEX

Answer 2

使用空间不敏感标志构建正则表达式

您可以使用space-insensitive flag忽略正则表达式中的空格和注释。请注意，一旦启用此标志，您将需要使用\s或其他显式字符来捕获空格，因为/x标志会导致忽略空格。

考虑以下示例：

INVALID_NAMES =
    /\b(bib$          |
        costumes$     |
        httpanties?   |
        necklace      |
        cuff\slink    |
        cufflink      |
        scarf         |
        pendant       |
        apron         |
        buckle        |
        beanie        |
        hat           |
        ring          |
        blanket       |
        polo          |
        earrings?     |
        plush         |
        pacifier      |
        tie$          |
        panties       |
        boxers?       |
        slippers?     |
        pants?        |
        leggings      |
        ibattz        |
        dress         |
        bodysuits?    |
        charm         |
        battstation   |
        tea           |
        pocket\sref   |
        pajamas?      |
        boyshorts?    |
        mimopowertube |
        coat          |
        bathrobe
    )\b/ix

请注意，您可以通过许多其他方式对其进行格式化，但每行使用一个表达式可以更轻松地对子表达式进行排序和编辑。如果您希望每行有多个替代品，那么您当然可以这样做。

确保无误

您可以通过以下示例看到上面的表达式按预期工作：

'cufflink'.match INVALID_NAMES
#=> #<MatchData "cufflink" 1:"cufflink">

'cuff link'.match INVALID_NAMES
#=> #<MatchData "cuff link" 1:"cuff link">

Answer 3

在正则表达式文字的中间添加换行符时，换行符将成为正则表达式的一部分。看看这个例子：

"ab" =~ /ab/ # => 0

"ab" =~ /a
b/ # => nil

"a\nb" =~ /a
b/ # => 0

您可以通过在行尾添加反斜杠来抑制换行：

"ab" =~ /a\
b/ # => 0

应用于你的正则表达式（前导空格也被移除）：

INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|\
cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|\
polo|earrings?|plush|pacifier|tie$|panties|boxers?|\
slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|\
battstation|tea|pocket ref|pajamas?|boyshorts?|\
mimopowertube|coat|bathrobe)\b/i

Answer 4

你可以这样做：

INVALID_NAMES = ['necklace', 'cuff link', 'cufflink', 'scarf', 'tie?', 'bib$']
r = Regexp.union(INVALID_NAMES.map { |n| /\b#{n}\b/i })

str = 'cat \n  cufflink bib cuff link. tie Scarf\n cow necklace? \n  ti. bib'
str.scan(r)
  #=> ["cufflink", "cuff link", "tie", "Scarf", "necklace", "ti", "bib"]

Answer 5

您的模式效率低下，会导致Regexp引擎严重崩溃。

我建议你研究一下Perl的Regexp :: Assemble可以做些什么来帮助你的Ruby代码：

＆＃34; How do I ignore file types in a web crawler?＆＃34;
＆＃34; Is there an efficient way to perform hundreds of text substitutions in Ruby?＆＃34;

在不同的行中分割大的正则表达式

5 个答案:

使用空间不敏感标志构建正则表达式

确保无误