Question

Ruby 1.9.1，OSX 10.5.8

我正在尝试编写一个简单的应用程序来解析一堆基于java的html模板文件，如果它包含在特定标记中，则用下划线替换句点（。）。我一直在使用ruby来处理这些类型的实用程序应用程序，并且认为使用ruby的正则表达式支持鞭打东西是没有问题的。所以，我创建了一个Regexp.new ...对象，打开一个文件，逐行读取，然后将每一行与模式匹配，如果我得到匹配，我使用replaceString = currentMatch.gsub创建一个新字符串（ /./，'_'），然后通过newReplaceRegex = Regexp.escape（currentMatch）创建另一个替换为整个字符串，最后用line.gsub（newReplaceRegex，replaceString）替换回当前行代码，当然，但是首先...

我遇到的问题是，当访问返回的MatchData对象中的索引时，我得到第一个结果两次，并且它缺少它应该找到的第二个子字符串。更奇怪的是，当使用rubular.com测试相同的模式和相同的测试文本时，它按预期工作。 See results here

我的模式：

（＆lt;（？：WEBOBJECT | webobject）（？：NAME | name）=（？：[a-zA-Z0-9] +。）+（？：[a-zA-Z0-9] + ）（？：＆GT））

文字文字：

<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText

以下是相关代码：

tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')

testFile = File.open（'RegexTestingCompFix.txt'，“r +”）
lineCount = 0
testFile.each {| htmlLine |
  lineCount + = 1
  puts（“当前行：＃{htmlLine}”在第num行：＃{lineCount}“）     tagMatch = tagRegex.match（htmlLine）
    if（tagMatch）

  matchesArray = tagMatch.to_a  
  firstMatch = matchesArray[0]  
  secondMatch = matchesArray[1]  
  puts "First match: #{firstMatch} and second match #{secondMatch}"  
  tagMatch.captures.each {|lineMatchCapture|  
    puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"  
    #create a new regex using the match results; make sure to use auto escape method  
    originalPatternString = Regexp.escape(lineMatchCapture)  
    replacementRegex = Regexp.new(originalPatternString)  
    #replace any periods with underscores in a copy of lineMatchCapture  
    periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')  
    #replace original match with underscore replaced copy within line  
    htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)  
    puts "The modified htmlLine is now: #{htmlLine}"    
    }  
end

}

我认为我应该在matchData [0]中获取第一个标签，然后在matchData 1中获取第二个标签，或者我正在做什么，因为我不知道我将会有多少匹配获取任何给定的行是matchData.to_a.each。在这种情况下，matchData有两个捕获，但它们都是第一个标记匹配

which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>

那么，我做错了什么，为什么rubular测试会给我预期的结果呢？

Answer 1

您想使用String#scan代替Regexp#match：

tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/

lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
     <WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"

lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]

关于下一个ruby问题的一些建议：

换行符和空格是你的朋友，你在代码中使用更多行不会失分; - ）
在块而不是do-end上使用{}，提高了可读性
以snake case（hello_world）声明变量而不是camel case（helloWorld）

希望这有帮助

Answer 2

我最终使用了String.scan方法，唯一棘手的问题是，它确定这会返回一个数组数组，而不是MatchData对象，所以我最初的混淆是因为我的红宝石绿色-ness，但它现在按预期工作。另外，我根据Trevoke的建议修剪了正则表达式。但蛇案？永远...... ;-)无论如何，这里是：

tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i  
testFile = File.open('RegexTestingCompFix.txt', "r+")  
lineCount=0  
testFile.each do |htmlLine|  
  lineCount += 1  
  puts ("Current line: #{htmlLine} at line num: #{lineCount}")  
    oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."  
    if(oldMatches.size > 0) 
      oldMatches.each_index do |index|   
        arrayMatch = oldMatches[index]  
        aMatch = arrayMatch[0]  
        #create a new regex using the match results; make sure to use auto escape method  
        replacementRegex = Regexp.new(Regexp.escape(aMatch))  
        #replace any periods with underscores in a copy of lineMatchCapture  
        periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')  
        #replace original match with underscore replaced copy within line, matching against the new escaped literal regex  
        htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)  
        puts "The modified htmlLine is now: #{htmlLine}"         
      end # I kind of still prefer the brackets...;-)  
    end  
  end

现在，为什么MatchData会像它一样工作？看起来它的行为确实是一个错误，如果你不能提供一个访问所有匹配的简单方法，它肯定不是很有用。只是我的$ .02

Answer 3

小位：这个正则表达式可以帮助你获得“normalMode”..但不是“secondLineMatch”：

<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")

这个正则表达式可以帮助你获得“secondLineMatch”......但不是“normalMode”：

<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").

我不是很擅长正则表达式，但我会继续努力... :)

我不知道这对你有什么帮助，但这是两种方法：

<webobject name=admin.(\w+) (with option 'i').

Ruby MatchData类重复捕获，而不是包含额外的捕获，因为它“应该”

3 个答案: