我整天都在考虑这个问题,我需要帮助解决它。
我有下面的html并想要提取匹配“?imgurl =”的查询参数的所有值。有人可以帮我解决这个问题吗?
</script></div><div id=nr_container><div id=center_col><div id=tbbcc><div id=tbbc style="background:#ebeff9;margin-bottom:4px;padding:8px;display:none"></div></div><div id=res class=med role=main><div id=topstuff></div><!--a--><h2 class=hd>Søgeresultater</h2><div id=ires><ol><script>google.isr.fillCanvas=function(i){var c=document.getElementById('cvs_'+i.id);try{c&&(c.getContext('2d').drawImage(i,0,0,c.offsetWidth,c.offsetHeight));}catch(e){c.style.display='none';i.style.display='block';}}</script><div id=rgsh_s></div><li><div id=rg><div id=rg_s><div id=rg_hp><a id=rg_hpl></a></div><div class=rg_h id=rg_h><div class=rg_hc><a class=rg_hl id=rg_hl><img class=rg_hi id=rg_hi></a><div class=std id=rg_hx><p class=rg_ht id=rg_ht><a id=rg_hta></a></p><p class=rg_hn id=rg_hn></p><p class=rg_hr><span id=rg_hr></span></p><p class=rg_ha><span id=rg_ha><a class=rg_hal id=rg_hals></a><span id=rg_has> ‑ </span><a class=rg_hal id=rg_haln></a><span id=rg_has2> ‑ </span><a class=rg_hal id=rg_halm></a></span></p></div></div></div><span class=rg_ctlv><ul class=rg_ul data-pg=1 data-cnt=44><li class=rg_li data-row=1 style="width:193px;height:145px" ><a class=rg_l style="width:193px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://www.eecs.berkeley.edu/~loarie/test.colors.gif&imgrefurl=http://s1mon.smartlog.dk/test-post37556&usg=__xdES-qA3W9Np6DMNDs0HPTe2Bn8=&h=606&w=807&sz=18&hl=da&start=1&zoom=1&tbnid=sFzpf2rpdeVHLM:&tbnh=107&tbnw=143&ei=Q9k-TYLkEob0swOzpdH0BA&prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_sFzpf2rpdeVHLM:l" style="display:block" width=193 height=145></canvas><img class=rg_i id=sFzpf2rpdeVHLM:l height=145 width=193 style="width:193px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:154px;height:145px" ><a class=rg_l style="width:160px;height:145px;margin-top:0px;margin-left:-2px" href="/imgres?imgurl=http://www.krymmel.dk/dev/media/.jkforum/test-pilot.png&imgrefurl=http://www.krymmel.dk/dev/pages/forum.php&usg=__a-KJQiDnKKy8LxlCV-d3XZpKGuw=&h=327&w=360&sz=110&hl=da&start=2&zoom=1&tbnid=KLm4Rocmahp8wM:&tbnh=110&tbnw=121&ei=Q9k-TYLkEob0swOzpdH0BA&prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_KLm4Rocmahp8wM:l" style="display:block" width=160 height=145></canvas><img class=rg_i id=KLm4Rocmahp8wM:l height=145 width=160 style="width:160px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:148px;height:145px" ><a class=rg_l style="width:148px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://colorvisiontesting.com/plate%2520with%25205.jpg&imgrefurl=http://colorvisiontesting.com/ishihara.htm&usg=__UfBI8sd8ldLjjiK3-7aGJo0zKy4=&h=309&w=315&sz=142&hl=da&start=3&zoom=1&tbnid=2_UMDol8AQhejM:&tbnh=115&tbnw=117&ei=Q9k-TYLkEob0swOzpdH0BA&prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_2_UMDol8AQhejM:l" style="display:block" width=148 height=145></canvas><img class=rg_i id=2_UMDol8AQhejM:l height=145 width=148 style="width:148px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:193px;height:145px" ><a class=rg_l style="width:193px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://pun.org/josh/archives/04.10.01.GlobalTest-X.gif&imgrefurl=http://hovedstaden.inetgiant.dk/fredensborg/AdDetails/test/3187460&usg=___4P_UDkeMuovXCIjq-PY9WhG1Vw=&h=391&w=520&sz=44&hl=da&start=4&zoom=1&tbnid=l15zkNo3p4iYcM:&tbnh=99&tbnw=131&ei=Q9k-TYLkEob0swOzpdH0BA&prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_l15zkNo3p4iYcM:l" style="display:block" width=193 height=145></canvas><img class=rg_i id=l15zkNo3p4iYcM:l height=145 width=193 style="width:193px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:193px;height:145px" ><a class=rg_l style="width:193px;height:139px;margin-top:3px;margin-left:0px" href="/imgres?imgurl=http://www.daimi.au.dk/~rvinge/Test_daimi.jpg&imgrefurl=http://www.daimi.au.dk/~rvinge/Hot.list.html&usg=__ofrC4G4FpZgXi95enpnIG4Wpdlg=&h=881&w=1223&sz=228&hl=da&start=5&zoom=1&tbnid=WDreIpjcKhg13M:&tbnh=108&tbnw=150&ei=Q9k-TYLkEob0swOzpdH0BA&prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_WDreIpjcKhg13M:l" style="display:block" width=193 height=139></canvas><img class=rg_i id=WDreIpjcKhg13M:l height=139 width=193 style="width:193px;height:139px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:143px;height:145px" ><a class=rg_l style="width:145px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://www.textually.org/tv/archives/images/set3/test-pattern-clock_4767.jpg&imgrefurl=http://hovedstaden.inetgiant.dk/fredensborg/AdDetails/test/3187460&usg=__BFaPejcst7ygnE72uTI6sJKxmIk=&h=308&w=307&sz=18&hl=da&start=6&zoom=1&tbnid=m1QYUHLkZ-mXCM:&tbnh=117&tbnw=117&ei=Q9k-TYLkEob0swOzpdH0BA&prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_m1QYUHLkZ-mXCM:l" style="display:block" width=145 height=145></canvas><img class=rg_i id=m1QYUHLkZ-mXCM:l height=145 width=145 style="width:145px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:118px;height:145px" ><a class=rg_l style="width:118px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://imgs.xkcd.com/comics/turing_test.png&imgrefurl=http://xkcd.com/329/&usg=__DdATXOcoguD2UbYUMs_iwi4r54I=&h=394&w=320&sz=22&hl=da&start=7&zoom=1&tbnid=UeYWZFjYErEM6M:&tbnh=124&tbnw=101&ei=Q9k-TYLkEob0swOzpdH0BA&prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_UeYWZFjYErEM6M:l" style="display:block" width=118 height=145></canvas><img class=rg_i id=UeYWZFjYErEM6M:l height=145 width=118 style="width:118px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:133px;height:145px" ><a class=rg_l style="width:149px;height:145px;margin-top:0px;margin-left:-4px" href="/imgres?imgurl=http://thomasdamgaard.dk/blog/images/test01.jpg&imgrefurl=http://thomasdamgaard.dk/blog/test-skilt-pa-motorvejen&usg=__quqWeHGs6OFAggLm5DBauetlRQU=&h=487&w=500&sz=22&hl=da&start=8&zoom=1&tbnid=HwAHMYrtavz5IM:&tbnh=127&tbnw=130&ei=Q9k-TYLkEob0swOzpdH0BA&prev=/images%3Fq%3Dtest%26hl%3Dda%26safe%3Doff%26sa%3DG%26as_st%3Dy%26biw%3D1680%26bih%3D897%26tbs%3Disch:1&itbs=1"><script>google.stb.csi.stTbn()</script><canvas id="cvs_HwAHMYrtavz5IM:l" style="display:block" width=149 height=145></canvas><img class=rg_i id=HwAHMYrtavz5IM:l height=145 width=149 style="width:149px;height:145px" onload="google.isr.fillCanvas(this);google.stb.csi.onTbn(1, this)"></a></li><li class=rg_li style="width:100px;height:145px" ><a class=rg_l style="width:102px;height:145px;margin-top:0px;margin-left:0px" href="/imgres?imgurl=http://www.ct4me.net/images/dmbtest.gif
答案 0 :(得分:1)
不要使用正则表达式来解析HTML。
请参阅here,了解原因。
为您的平台/语言使用HTML解析器。
编辑:
正如您已指出使用C#,我建议使用HTML Agility Pack - 它被广泛使用,可以使用XPath查询,如XmlDocument。
根据您的特殊需要,我会获取所有链接,并且每次使用string.Split
都可以获取所需的查询字符串参数。
答案 1 :(得分:1)
让我感到恼火的是,人们如此迅速地跳过不使用正则表达式来解析HTML。无论如何,你真的不是在解析HTML。即使您使用Html Agility Pack从html中提取URL,您仍然需要从每个查询字符串中提取imgurl
个参数。
正则表达式非常适合从查询字符串中提取参数,这可以实现您的目的:
string input = "your big HTML string";
MatchCollection matches = Regex.Matches(
input,
@"(?<=[?&]imgurl=)[^&#'"]*",
RegexOptions.IgnoreCase // remove this if you don't want to ignore case in "imgurl"
);
我全都是使用HTML Agility Pack来实际解析HTML,但是如果你只是想从一个更大的字符串中删除一些字符串(这符合一个定义良好的模式),那么就没有比这更好的工具了。正则表达式。使用正则表达式来解析HTML标记的原因是HTML的结构不可靠。 URL的查询字符串必须采用特定格式,因此使用正则表达式是安全的。