用于html解析器的Ruby正则表达式或nokogiri

时间:2013-09-19 12:51:20

标签: ruby regex html-parsing

我有这样的HTML部分:

<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTIwMjYxMTgwOTAPFgIeBGd1aWQFIDI1NmYyOTdkZWZhNjQyODhhYTVmOWI4MGE5MzRjNjlhFgJmD2QWAgIDDxYCHgxhdXRvY29tcGxldGUFA29mZhYCAgEPZBYGZg9kFgJmD2QWAgIFDw8WAh4EVGV4dAUO0JLRi9C50LTQuNGC0LVkZAICD2QWAgICD2QWAgIBDxAPFgYeDURhdGFUZXh0RmllbGQFBU5hendhHg5EYXRhVmFsdWVGaWVsZAUQSURXZXJzamVKZXp5a293ZR4LXyFEYXRhQm91bmRnZBAVAwZQb2xza2EHRW5nbGlzaA7QoNGD0YHRgdC60LDRjxUDATEBMgIxNxQrAwNnZ2cWAQICZAIED2QWAmYPZBYCAgEPZBYCZg9kFgICAQ9kFgICAw9kFgQCAg8PFgQeJk5vQm90X1Jlc3BvbnNlVGltZUtleV9jdGwwMCRjcCROb0JvdElEBnWRyyBNg9BIHiROb0JvdF9TZXNzaW9uS2V5S2V5X2N0bDAwJGNwJE5vQm90SUQFNE5vQm90X1Nlc3Npb25LZXlfY3RsMDAkY3AkTm9Cb3RJRF82MzUxNTE5MTQ3MjUxNzIzNDFkFgICAQ8WAh4PQ2hhbGxlbmdlU2NyaXB0BVl2YXIgZSA9IGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKCdjdGwwMF9jcF9Ob0JvdElEX3BjbmInKTsgZS5vZmZzZXRXaWR0aCArIGUub2Zmc2V0SGVpZ2h0O2QCAw9kFgICAQ9kFgJmD2QWBAIDD2QWBGYPZBYCAgEPZBYIAgEPDxYGHgpQbGFpblZhbHVlBQQ0RDVoHg5FbmNyeXB0ZWRWYWx1ZQUgMjQxZjdmZGQ3ODIxNGJhYzgyOGNhNzU3ZDY4NWI3Y2IeB1Zpc2libGVoZGQCAw8PFgIfC2hkZAIFDw8WAh8LaGRkAgcPFgIfC2dkAgMPZBYCZg9kFgICAQ8PFgIfAgUK0JTQsNC70LXQtWRkAgUPZBYKZg9kFgRmD2QWAgIBDw8WAh8CBRPQktC40LQg0YPRgdC70YPQs9C4ZGQCAQ9kFgICAQ8QZGQWAGQCAQ9kFgRmD2QWAgIBDw8WAh8CBR7QnNC10YHRgtC+0L3QsNGF0L7QttC00LXQvdC40LVkZAIBD2QWAgIBDxBkZBYAZAICD2QWBGYPZBYCAgEPDxYCHwIFCNCh0YDQvtC6ZGQCAQ9kFgICAQ8QDxYCHgxBdXRvUG9zdEJhY2toFgIeCG9uQ2hhbmdlBR5jYkR6aWVuR29kemluYV9vbkNoYW5nZSh0aGlzKTtkFgBkAgMPZBYEZg9kFgICAQ8PFgIfAgUG0YfQsNGBZGQCAQ9kFgICAQ8QZGQWAGQCBQ9kFgJmD2QWBAIBDw8WAh8CBSTQl9Cw0YDQtdCz0LjRgdGC0YDQuNGA0L7QstCw0YLRjNGB0Y9kZAIDDw8WAh8CBTbQntGC0YHRg9GC0YHRgtCy0LjQtSDRgdCy0L7QsdC+0LTQvdGL0LUg0LTQsNGC0Ysg0LTQviBkZBgBBR1jdGwwMCRjcCRldmVudE9yZGVyVmFsaWRhdGlvbg8PZDKpAwABAAAA/////wEAAAAAAAAADAIAAABGTVNaX1dXV19LTElFTlQsIFZlcnNpb249Mi4xMi4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49bnVsbAUBAAAATU1TWl9XV1dfS0xJRU5ULktvbnRyb2xraS5FdmVudE9yZGVyVmFsaWRhdGlvbitFdmVudE9yZGVyVmFsaWRhdGlvbkNvbnRyb2xEYXRhBAAAAAtFdmVudE51bWJlchNFdmVudEV4cGlyYXRpb25UaW1lDVNlY3VyaXR5VG9rZW4MTGljem5pa1Rva2VuAwMBAAxTeXN0ZW0uSW50MzJxU3lzdGVtLk51bGxhYmxlYDFbW1N5c3RlbS5EYXRlVGltZSwgbXNjb3JsaWIsIFZlcnNpb249NC4wLjAuMCwgQ3VsdHVyZT1uZXV0cmFsLCBQdWJsaWNLZXlUb2tlbj1iNzdhNWM1NjE5MzRlMDg5XV0IAgAAAAgIAAAAAAoGAwAAACBmNjZlZTgwODEwZWM0ZjcwYThhZjY2ZDcyNDlmNWFjZgEAAAALZGrI7rw4FqPtCexAP1+dCQ7Qps1t">

我尝试使用这样的正则表达式:

VIEWSTATE = (/<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="(.*?)" \/>/.match body_text)[1]

但似乎它工作很奇怪,并没有取得所有价值,只是其中的一部分。在那种情况下,我必须使用哪个正则表达式? (注意,正则表达式中需要部分VIEWSTATE“id =”_ VIEWSTATE“value =。)

如果你告诉我怎么会好的

1 个答案:

答案 0 :(得分:2)

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse <<-eot
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTIwMjYxMTgwOTAPFgIeB">
eot

# using css
nd = doc.at_css('input#__VIEWSTATE')
nd['name'] 
# => "__VIEWSTATE"

# using xpath
nd = doc.at('//div[@id ="__VIEWSTATE"]')
nd['name'] 
# => "__VIEWSTATE"

这样您就可以获得input html元素的每个属性值。