Question

是否有人了解为什么ref_id中的指定群组regex1在下方的捕获中包含Some address: loststreet 4？

我希望它只是loststreet 4，我不明白为什么不是。以下代码来自IRB会议。

我考虑过字符串的编码：

str1 = <<eos
Burp
FirstName: Al Bundy
Ref person:
Some address: loststreet 4
Some other address: loststreet 4
Zip code:
eos
# => "Burp\nFirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4\nZip code:\n" 

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi
# => /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 

str1.match(regex1)
# => #<MatchData "FirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4" name:"Al Bundy" ref_id:"Some address: loststreet 4" other:"loststreet 4"> 

str1.encoding
# => #<Encoding:UTF-8> 

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/miu
# => /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 

str1.match(regex1)
# => #<MatchData "FirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4" name:"Al Bundy" ref_id:"Some address: loststreet 4" other:"loststreet 4">

Answer 1

使用MatchData#[]获取特定的组字符串：

str1 = <<eos
Burp
FirstName: Al Bundy
Ref person:
Some address: loststreet 4
Some other address: loststreet 4
Zip code:
eos

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi
matched = str1.match(regex1)

matched['name'] # => "Al Bundy"
matched['other'] # => "loststreet 4"

Answer 2

因为您在正则表达式中写了一个可选的\s?（在“Ref person：”之后），它可以匹配换行符\n（当参数为空时）。将其替换为[^\S\n]?（您必须对不能换行的所有\s?执行相同操作。）

（请注意，在每个参数之后，您使用.*转到下一个参数，将其替换为.*?这是懒惰的，以避免过多的回溯）

Answer 3

编写代码的目标之一是使其可维护。使其易于维护包括让那些在处理该代码时遵循的人容易阅读和理解。

正则表达式通常是维护的噩梦，根据我的经验，通常可以减少其复杂性，或者完全替换，以提出同样有用的代码。解析这种文本是何时不使用复杂模式的一个很好的例子。

我这样做：

str1 = <<eos
Burp
FirstName: Al Bundy
Ref person:
Some address: loststreet 4
Some other address: loststreet 4
Zip code:
eos

def get_value(s)
  _, value = s.split(':')
  value.strip if value
end

rows = str1.split("\n")
firstname          = get_value(rows[1]) # => "Al Bundy"
ref_person         = get_value(rows[2]) # => nil
some_address       = get_value(rows[3]) # => "loststreet 4"
some_other_address = get_value(rows[4]) # => "loststreet 4"
zip_code           = get_value(rows[5]) # => nil

将文本拆分为行，然后选择所需的数据。

可以使用map将其简化为更简洁的内容：

firstname, ref_person, some_address, some_other_address, zip_code = rows[1..-1].map{ |s| get_value(s) }
firstname          # => "Al Bundy"
ref_person         # => nil
some_address       # => "loststreet 4"
some_other_address # => "loststreet 4"
zip_code           # => nil

如果你必须拥有一个正则表达式，只需要一个正则表达式，然后简化它并隔离它的任务。虽然可以编写一个可以跨越多行的正则表达式，跳过并捕获文本，但是它会变得很痛苦，随着它的增长它会变得越来越脆弱，如果传入的文本发生变化，它可能会破坏。通过降低其复杂性，您更有可能避免脆弱，并使您的代码更加健壮：

def get_value(s)
  s[/^([^:]+):(.*)/]
  name, value = $1, $2
  value.strip! if value

  [name.downcase.tr(' ', '_'), value]
end

data_hash = Hash[
  str1.split("\n").select{ |s| s[':'] }.map{ |s| get_value(s) }
]
data_hash # => {"firstname"=>"Al Bundy", "ref_person"=>"", "some_address"=>"loststreet 4", "some_other_address"=>"loststreet 4", "zip_code"=>""}

Answer 4

看起来你的正则表达式遗漏了一些部分。请尝试：

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some address:\s?(?<address>[^\n]*).*Some other address:\s?(?<other>[^\n]*)/mi

使用扩展模式可以更轻松：

regex1 = %r{
  FirstName:\s?(?<name>[^\n]*).*
  Ref\ person:\s?(?<ref_id>[^\n]*).*
  Some\ address:\s?(?<address>[^\n]*).*
  Some\ other\ address:\s?(?<other>[^\n]*)
}xmi

确保逃避常规空间。

为什么这个名为group的正则表达式会捕获错误的文本？

4 个答案: