关于Java Regular Expression的初学者问题

时间:2010-04-11 21:01:53

标签: java html regex

我最近开始学习Java正则表达式,我发现了一些非常有用的任务。例如,我现在需要从以下HTML代码中挖掘出“产品名称”,“产品描述”和“此产品的卖家”。 (我很抱歉代码很大,但它非常简单)

<td class="sr-check">
<input type="checkbox" name="cptitle" value="678560038" /></td>
<td class="sr-image" style="width: 80px;"><a href="/Nikon-D300S-12-3-678560038/prices-html"     class="strictRule" rel="nofollow"><img src="http://img01.static-nextag.com/image/Nikon-D300S-12-3-MP-Digital-SLR-Camera-Body-Black/0/000/006/789/461/678946110.jpg" alt="Nikon D300S 12.3 MP Digital SLR Camera Body - Black" class="imageLink strictRule" height="75" width="75" id="opILink_0" title="Nikon Digital Cameras - Nikon D300S 12.3 MP Digital SLR Camera Body - Black" /></a><div class="breaker">&nbsp;</div></td>
<td class="sr-info">
<div class="sr-info">
<a id="opPNLink_0" class="underline" style="font-size:16px" href="/Nikon-D300S-12-3-678560038  /prices-html" >Nikon D300S 12.3 MP <b>Digital</b> SLR <b>Camera</b> Body - Black</a> <div class="sr-subinfo">
<div class="sr-info-description">SLR - 13.1MP, 12.3MP - 1x Optical Zoom - CompactFlash, SD/MMC Memory Card - 3in.</div>
<div class="rating">
<img src="http://img01.static-nextag.com/imagefiles/stars/stars4_10px.gif" alt="4/5 stars" title="4/5 stars" /> (92 user ratings)</div>
<div style="clear: both;">
<!-- nxtginc=nextag.api.ServerInclude$JSPIncludeWriter(/buyer/ATLSSI.jsp?ptid=678560038&dts=y) -->
<a id="_atl_0" style="" href="http://www.nextag.com/serv/main/buyer/MyPDir.jsp?list=_transCookieList&amp;cmd=add&amp;ptitle=678560038" rel="nofollow">+ Add to Shopping List</a> &nbsp;|&nbsp; 
<!-- endnxtginc -->
<a rel="nofollow" id="mltLink_0" class="mlt-link" href="/Digital-Cameras--zz500001z2z678560038zB2dgz5---html">See More Like This</a>
</div>
<div id="fsLink_0" class="featuredSeller">
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_0" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=785646073amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNnaIH00iKSUmBawDRvecwbCpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038"  target="_blank" >Thundercameras</a>:$1,289 &nbsp;
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_1" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=797076595&amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNrcWLhL%2BhryuAGhXNhYSPE%2BpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038"  target="_blank" >PhotoVideoSuperStore</a>:$1,269 &nbsp;
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_2" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=803555293&amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNt06qcvLJ5UQz7S3zKd4urWpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038"  target="_blank" >Digitalelect</a>:$1,279 &nbsp;</div>

我会想到:

(1)从<td class="sr-image >标签中挖掘出产品名称,并使用正则表达式

exp ="<td><span\\s+class=\"sr-image\"[^>]*>"
          + ".*?</span><a href=\""
          + "([^\"]+)"      
          + "\"[^>]*>"      
          + "([^<]+)" + "</a>.*?</td>";

(2)从<div class="sr-info-description">标签中挖掘出产品信息。

exp = "<div class="sr-info-description"> [^>]*>"

(3)从<div id="fsLink_0" class="featuredSeller">标签中挖掘出卖家的名字。

exp = "<div id="fslink_0" class="featuredSeller[^>]*>"
          + ".*?</span><a rel=\""
          + "([^\"]+)"      
          + "\"[^>]*>"      
          + "([^<]+)" + "</a>.*?</td>";

我刚刚开始学习使用Java正则表达式,如果我在错误的轨道上或者我的正则表达式错误的话你能纠正我,我将不胜感激。 非常感谢,伙计们。

1 个答案:

答案 0 :(得分:1)

如上所述,您应该使用解析器来解释html输入。

但我想回答一个正则表达式的问题,从文本行中提取产品信息,如

<div class="sr-info-description">SLR - 13.1MP, 12.3MP - 1x Optical Zoom - CompactFlash, SD/MMC Memory Card - 3in.</div>

假设它是一行并且本身不包含任何标记(在这种情况下你绝对需要使用html解析器),正则表达式应该看起来像

<div class="sr-info-description">([^<]*)</div>

为您的表达式构建匹配器,在输入中为find()构建它,然后group(1)包含div标记内的文本(而group(0)包含匹配的区域,包括div标记)