使用正则表达式提取多个查询参数

时间:2011-01-25 21:14:21

标签: c# html regex url

我整天都在考虑这个问题,我需要帮助解决它。

我有下面的html并想要提取匹配“?imgurl =”的查询参数的所有值。有人可以帮我解决这个问题吗?

</script></div><div id=nr_container><div id=center_col><div id=tbbcc><div id=tbbc style="background:#ebeff9;margin-bottom:4px;padding:8px;display:none"></div></div><div id=res class=med role=main><div id=topstuff></div><!--a--><h2 class=hd>Søgeresultater</h2><div id=ires><ol><script>google.isr.fillCanvas=function(i){var c=document.getElementById('cvs_'+i.id);try{c&&(c.getContext('2d').drawImage(i,0,0,c.offsetWidth,c.offsetHeight));}catch(e){c.style.display='none';i.style.display='block';}}</script><div id=rgsh_s></div><li><div id=rg><div id=rg_s><div id=rg_hp><a id=rg_hpl></a></div><div class=rg_h id=rg_h><div class=rg_hc><a class=rg_hl id=rg_hl><img class=rg_hi id=rg_hi></a><div class=std id=rg_hx><p class=rg_ht id=rg_ht><a id=rg_hta></a></p><p class=rg_hn id=rg_hn></p><p class=rg_hr><span id=rg_hr></span></p><p class=rg_ha><span id=rg_ha><a class=rg_hal id=rg_hals></a><span id=rg_has>&nbsp;&#8209;&nbsp;</span><a class=rg_hal id=rg_haln></a><span id=rg_has2>&nbsp;&#8209;&nbsp;</span><a class=rg_hal id=rg_halm></a></span></p></div></div></div><span class=rg_ctlv><ul class=rg_ul data-pg=1 data-cnt=44><li class=rg_li data-row=1 style="width:193px;height:145px" ><a class=rg_l style="width:193px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://www.eecs.berkeley.edu/~loarie/test.colors.gif&amp;imgrefurl=http://s1mon.smartlog.dk/test-post37556&amp;usg=__xdES-qA3W9Np6DMNDs0HPTe2Bn8=&amp;h=606&amp;w=807&amp;sz=18&amp;hl=da&amp;start=1&amp;zoom=1&amp;tbnid=sFzpf2rpdeVHLM:&amp;tbnh=107&amp;tbnw=143&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_sFzpf2rpdeVHLM:l" style="display:block" width=193 height=145></canvas><img class=rg_i id=sFzpf2rpdeVHLM:l height=145 width=193 style="width:193px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:154px;height:145px" ><a class=rg_l style="width:160px;height:145px;margin-top:0px;margin-left:-2px" href="/imgres?imgurl=http://www.krymmel.dk/dev/media/.jkforum/test-pilot.png&amp;imgrefurl=http://www.krymmel.dk/dev/pages/forum.php&amp;usg=__a-KJQiDnKKy8LxlCV-d3XZpKGuw=&amp;h=327&amp;w=360&amp;sz=110&amp;hl=da&amp;start=2&amp;zoom=1&amp;tbnid=KLm4Rocmahp8wM:&amp;tbnh=110&amp;tbnw=121&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_KLm4Rocmahp8wM:l" style="display:block" width=160 height=145></canvas><img class=rg_i id=KLm4Rocmahp8wM:l height=145 width=160 style="width:160px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:148px;height:145px" ><a class=rg_l style="width:148px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://colorvisiontesting.com/plate%2520with%25205.jpg&amp;imgrefurl=http://colorvisiontesting.com/ishihara.htm&amp;usg=__UfBI8sd8ldLjjiK3-7aGJo0zKy4=&amp;h=309&amp;w=315&amp;sz=142&amp;hl=da&amp;start=3&amp;zoom=1&amp;tbnid=2_UMDol8AQhejM:&amp;tbnh=115&amp;tbnw=117&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_2_UMDol8AQhejM:l" style="display:block" width=148 height=145></canvas><img class=rg_i id=2_UMDol8AQhejM:l height=145 width=148 style="width:148px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:193px;height:145px" ><a class=rg_l style="width:193px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://pun.org/josh/archives/04.10.01.GlobalTest-X.gif&amp;imgrefurl=http://hovedstaden.inetgiant.dk/fredensborg/AdDetails/test/3187460&amp;usg=___4P_UDkeMuovXCIjq-PY9WhG1Vw=&amp;h=391&amp;w=520&amp;sz=44&amp;hl=da&amp;start=4&amp;zoom=1&amp;tbnid=l15zkNo3p4iYcM:&amp;tbnh=99&amp;tbnw=131&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_l15zkNo3p4iYcM:l" style="display:block" width=193 height=145></canvas><img class=rg_i id=l15zkNo3p4iYcM:l height=145 width=193 style="width:193px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:193px;height:145px" ><a class=rg_l style="width:193px;height:139px;margin-top:3px;margin-left:0px" href="/imgres?imgurl=http://www.daimi.au.dk/~rvinge/Test_daimi.jpg&amp;imgrefurl=http://www.daimi.au.dk/~rvinge/Hot.list.html&amp;usg=__ofrC4G4FpZgXi95enpnIG4Wpdlg=&amp;h=881&amp;w=1223&amp;sz=228&amp;hl=da&amp;start=5&amp;zoom=1&amp;tbnid=WDreIpjcKhg13M:&amp;tbnh=108&amp;tbnw=150&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_WDreIpjcKhg13M:l" style="display:block" width=193 height=139></canvas><img class=rg_i id=WDreIpjcKhg13M:l height=139 width=193 style="width:193px;height:139px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:143px;height:145px" ><a class=rg_l style="width:145px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://www.textually.org/tv/archives/images/set3/test-pattern-clock_4767.jpg&amp;imgrefurl=http://hovedstaden.inetgiant.dk/fredensborg/AdDetails/test/3187460&amp;usg=__BFaPejcst7ygnE72uTI6sJKxmIk=&amp;h=308&amp;w=307&amp;sz=18&amp;hl=da&amp;start=6&amp;zoom=1&amp;tbnid=m1QYUHLkZ-mXCM:&amp;tbnh=117&amp;tbnw=117&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_m1QYUHLkZ-mXCM:l" style="display:block" width=145 height=145></canvas><img class=rg_i id=m1QYUHLkZ-mXCM:l height=145 width=145 style="width:145px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:118px;height:145px" ><a class=rg_l style="width:118px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://imgs.xkcd.com/comics/turing_test.png&amp;imgrefurl=http://xkcd.com/329/&amp;usg=__DdATXOcoguD2UbYUMs_iwi4r54I=&amp;h=394&amp;w=320&amp;sz=22&amp;hl=da&amp;start=7&amp;zoom=1&amp;tbnid=UeYWZFjYErEM6M:&amp;tbnh=124&amp;tbnw=101&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_UeYWZFjYErEM6M:l" style="display:block" width=118 height=145></canvas><img class=rg_i id=UeYWZFjYErEM6M:l height=145 width=118 style="width:118px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:133px;height:145px" ><a class=rg_l style="width:149px;height:145px;margin-top:0px;margin-left:-4px" href="/imgres?imgurl=http://thomasdamgaard.dk/blog/images/test01.jpg&amp;imgrefurl=http://thomasdamgaard.dk/blog/test-skilt-pa-motorvejen&amp;usg=__quqWeHGs6OFAggLm5DBauetlRQU=&amp;h=487&amp;w=500&amp;sz=22&amp;hl=da&amp;start=8&amp;zoom=1&amp;tbnid=HwAHMYrtavz5IM:&amp;tbnh=127&amp;tbnw=130&amp;ei=Q9k-TYLkEob0swOzpdH0BA&amp;prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&amp;itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_HwAHMYrtavz5IM:l" style="display:block" width=149 height=145></canvas><img class=rg_i id=HwAHMYrtavz5IM:l height=145 width=149 style="width:149px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:100px;height:145px" ><a class=rg_l style="width:102px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://www.ct4me.net/images/dmbtest.gif

2 个答案:

答案 0 :(得分:1)

不要使用正则表达式来解析HTML。

请参阅here,了解原因。

为您的平台/语言使用HTML解析器。


编辑:

正如您已指出使用C#,我建议使用HTML Agility Pack - 它被广泛使用,可以使用XPath查询,如XmlDocument。

根据您的特殊需要,我会获取所有链接,并且每次使用string.Split都可以获取所需的查询字符串参数。

答案 1 :(得分:1)

让我感到恼火的是,人们如此迅速地跳过不使用正则表达式来解析HTML。无论如何,你真的不是在解析HTML。即使您使用Html Agility Pack从html中提取URL,您仍然需要从每个查询字符串中提取imgurl个参数。

正则表达式非常适合从查询字符串中提取参数,这可以实现您的目的:

string input = "your big HTML string";
MatchCollection matches = Regex.Matches(
    input, 
    @"(?<=[?&]imgurl=)[^&#'"]*", 
    RegexOptions.IgnoreCase // remove this if you don't want to ignore case in "imgurl"
);

我全都是使用HTML Agility Pack来实际解析HTML,但是如果你只是想从一个更大的字符串中删除一些字符串(这符合一个定义良好的模式),那么就没有比这更好的工具了。正则表达式。使用正则表达式来解析HTML标记的原因是HTML的结构不可靠。 URL的查询字符串必须采用特定格式,因此使用正则表达式是安全的。