Question

我有大字符串：

string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec nec neque..."
puts string.size # => 54555999

我还有一个大正则表达式：

regex = /Lorem|ipsum|dolor|sit|amet|consectetur|adipiscing|elit|Donec|nec|neque|facilisis|nulla|rhoncus|accumsan|non|in|arcu|Interdum|et|malesuada|fames|ac|ante|primis|faucibus|Pellentesque|habitant|morbi|tristique|senectus|netus|turpis|egestas|at|ut|metus|convallis|fringilla|Nullam|volutpat|risus|sodales|elementum|Fusce|vitae|dignissim|tortor|Vivamus|interdum|dapibus|leo|sed|Quisque|luctus|dui|ligula|consequat|augue|congue|a|Vestibulum|id|cursus|odio|Maecenas|libero|diam|placerat|Proin|sapien|gravida|Cras|eleifend|nisl|rutrum|lectus|Curabitur|auctor|urna|tellus|tincidunt|erat|eget|vulputate|nibh|tempor|Ut|vehicula|nisi|velit|suscipit|nunc|tempus|vestibulum|viverra|Duis|sagittis|dictum|justo|hendrerit|massa|mollis|ultricies|lorem|imperdiet|mattis|pharetra|Aenean|lacus|purus|condimentum|Integer|sem|ullamcorper|feugiat|venenatis|quis|pellentesque|felis|finibus|porta|Nam|pulvinar|est|Morbi|ex|eros|commodo|Praesent|mauris|scelerisque|enim|aliquet|Etiam|Mauris|eu|bibendum|efficitur|magna|maximus|ornare|Phasellus|vel|blandit|sollicitudin|Suspendisse|Sed|quam|pretium|mi|semper|molestie|In|Nulla|Aliquam|euismod|orci|varius|hac|habitasse|platea|dictumst|iaculis|ultrices|Nunc|aliquam|fermentum|lacinia|lobortis|porttitor|laoreet|posuere|cubilia|Curae|facilisi|potenti|Cum|sociis|natoque|penatibus|magnis|dis|parturient|montes|nascetur|ridiculus|mus|Class|aptent|taciti|sociosqu|ad|litora|torquent|per|conubia|nostra|inceptos|himenaeos/

我现在希望scan string与regex：

puts Benchmark.measure { string.scan(regex) } # => 17.830000   0.380000  18.210000 ( 18.235952)

正如您所见扫描需要17.83长秒，这对我的用例来说太过分了。

如何加快速度？

match和=~方法对我没有好处，因为我需要匹配数组。我想也许我需要走出Ruby以加快速度，但我不知道如何。

现实世界正则表达式是很多订单ID，现实世界字符串是很多电子邮件（主题和内容）合并到一个字符串。扫描电子邮件以提及订单ID的目的是找出客户查询的订单ID。

修改

我之所以要进行初步的正则表达式测试是因为我之后需要制作一个非常大的循环：

["order_id_1", "order_id_2"].each do |order_id|
  ["email body 1", "email body 2"].delete_if do |email_string|
    match = (email_string =~ Regexp.new(order_id)) != nil
    if match
      # do stuff
    end
    match
    end
  end
end

对于许多订单ID和许多电子邮件，这将成为一个非常大的循环，除非我最初评估哪些订单ID已经对应。

Answer 1

假设order_ids是一个订单ID数组，我假设它包含数字和/或数字，并且大小写不是问题。如果email是包含一封电子邮件的字符串，

email.scan(/\w+/) & order_ids

将为您提供该电子邮件中包含的所有订单ID。但是，您可以通过首先将order_ids转换为集合来加快速度：

require 'set'
order_id_set = order_ids.to_set

email.scan(/\w+/).select { |w| order_id_set.include?(w) }

无论如何，将电子邮件串在一起没有任何好处。

我建议你使用以下方法将电子邮件分成单词：

email.scan(/\w+/)

但你可能想要更精细的东西。假设电子邮件如下：

email = "I am really annoyed that my order AB123 was shipped late, " +
  "that order CD456was poorly packed and I was overcharged for order EF789."

让我们看两种可能将其分为单词的方法：

esplit = email.split
  #=> ["I", "am", "really", "annoyed", "that", "my", "order", "AB123",
  #    "was", "shipped", "late,", "that", "order", "CD456was", "poorly",
  #    "packed", "and", "I", "was", "overcharged", "for", "order", "EF789."]

escan = email.scan(/\w+/)
  #=> ["I", "am", "really", "annoyed", "that", "my", "order", "AB123",
  #    "was", "shipped", "late", "that", "order", "CD456was", "poorly",
  #    "packed", "and", "I", "was", "overcharged", "for", "order", "EF789"]

你可以看到这两种方法都在酝酿着麻烦（例如，"CD456was"）。现在让我们创建一组订单ID：

require 'set'
order_id_set = %w{ AB123 CD456 EF789 GH012 }.to_set
  #=> #<Set: {"AB123", "CD456", "EF789", "GH012"}>

然后在电子邮件中搜索订单ID：

esplit.select { |w| order_id_set.include?(w) }
  #=> ["AB123"]
escan.select  { |w| order_id_set.include?(w) }
  #=> ["AB123", "EF789"]

如您所见，split（与split(' ')和split(/\s+/)相同），它在空白处拆分行，捕获了第一个订单ID，但错过了其他两个，{{1因为作者未能在该ID与单词CD456和"was"之间添加空格，因为它正在寻找EF789。

"EF789."，分割出＆＃34;字＆＃34; （scan(/\w+/)查找\w，a-z，A-Z和0-9），做得更好，将_标识为订单ID，因为句号是＆＃34;非单词＆＃34;字符。它也错过了"EF789"，因为"CD456"，"w"和"a"是单词字符，所以它包含了它们。（这与您的"s"示例类似。

这意味着您需要制作更好的正则表达式。例如，如果所有订单ID都与我的示例中的订单ID类似 - 两个大写字母后跟三位数字 - 您可以写道：

"1234"

这匹配紧跟在大写字母之后的两个大写字母，后跟三个数字，紧接着是除了另一个数字之外的任何字符。这只是可能的例证。您可能希望发布另一个仅涉及使用正则表达式来识别电子邮件中可能的订单号的问题。为此，您必须更好地描述订单号的样子。

无论如何，你的正则表达式不会包含你提到的email.scan(/(?<![A-Z])[A-Z]{2}\d{3}(?!\d)/) .select { |w| order_id_set.include?(w) } #=> ["AB123", "CD456", "EF789"]。这样做的唯一方法是使用单独的正则表达式检查每个电子邮件的每个订单ID，这将是非常低效的。相反，您希望使用正则表达式来识别可能的订单号，然后根据订单号集检查它们。在选择正则表达式时，您可能需要在错过的订单ID数量和浪费时间检查订单ID的非订单ID字符串数量之间进行权衡。

Answer 2

不要使用正则表达式。相反，请使用适合您要匹配的特定字符串和模式的专用算法。如果你绝对需要使用正则表达式，你可能会尝试找到一个不同于Ruby本身使用的引擎（虽然我怀疑这将是非常富有成效的）。如果你仍然需要更好的性能，可以用更快/更低级别的语言（如C语言）编写算法，然后使用原生扩展或其他东西从ruby中调用它。

以下是一些有助于原生扩展的资源：

至于制作专门的算法，我可能无法帮助你。

Answer 3

这是一个在O(n^2)时间内运行的算法。这比正则表达式的性能要好得多，这可能比O(n^2)差得多。

这种算法背后的直觉是使用字符串和正则表达式在计算上是昂贵的！当在小字符串上运行或不经常运行时，正则表达式是强大的工具，但是当正则表达式和字符串都很复杂时，它们表现不佳。但是，在您的特定情况下，您对字符串是否以特定方式进行格式化并不感兴趣。相反，您想要找出哪些电子邮件与哪些订单ID匹配。我们可以直接执行此操作，而无需使用正则表达式。

根据您的一条评论，您收到的数据最初不是长字符串和长正则表达式，而是两个大型列表。这很好，因为使用列表可以更容易地使用算法。由于您有兴趣知道哪些订单ID与电子邮件匹配，我们将设计我们的算法，以便生成一个包含与电子邮件匹配的每个订单ID的数组。

这是algorthm：

创建匹配订单ID的数组。最初，它将是空的。
对于电子邮件数组中的每个字符串，
1. 对于订单ID数组中的每个订单ID，请查看其中是否有任何订单ID与电子邮件匹配。
2. 如果我们找到匹配项，请将订单ID添加到匹配订单ID的数组中。
匹配的订单ID数组包含与电子邮件匹配的每个订单ID。

如果有n封电子邮件和m订单ID，则此算法需要O(nm)次。如果n和m大致相同，则算法会在O(n^2)时间内运行。

这是一些实现此算法的Ruby代码：

emails = ['email1', 'email2', 'email3', #etc.
order_ids = ['order1', 'order2', 'order3', #etc.

# Create an array to hold the matched order IDs
matched_order_ids = []

# Perform a search for each email
emails.each |email| do
  order_ids.delete_if |order_id| do
    if email.include? order_id
      matched_order_ids.push order_id

      # tells delete_if to remove the order_id,
      # saving us time searching in future emails
      return true
    else
      # tells delete_if not to remove the order_id
      return false
    end
  end
end

# At this point, matched_order_ids contains every
# order ID that was matched in at least one of the
# emails

如何使用大正则表达式加快扫描大字符串？

3 个答案: