我希望从包含特定日期的某些HTML中提取特定的HTML标记。
单元测试中提供的HTML是:
以下是有问题的单元测试:
public void testParseBasePage(){
defenseGovContractsParser a = new defenseGovContractsParser("060613");
String expected = "http://www.defense.gov/contracts/contract.aspx?contractid=5059";
String result = a.parseBasePage("<td><a id=\"ctl00_ContentPlaceHolder_Body_ContractSummary_dgPRItems_ctl02_lnkPressItem\" title=\"Click for Contracts for June 06, 2013\" class=\"Link12\" href=\"http://www.defense.gov/contracts/contract.aspx?contractid=5059\">Contracts for June 06, 2013</a><span id=\"ctl00_ContentPlaceHolder_Body_ContractSummary_dgPRItems_ctl02_lblSubTitle\" class=\"MoreNews3a\"></span></td>");
assertEquals(expected,result);
}
以下是相关代码:
public String parseBasePage(String HTML) {
String contractUrl;
String yr = date.substring(4, 6);
String day = date.substring(2, 4);
String month = getMonthForInt(Integer.parseInt(date.substring(0, 2)));
Pattern getLink = Pattern.compile("<a.*?" + month + ".*?" + day + ".*?20" + yr + ".*?>");
Matcher match = getLink.matcher(HTML);
String link = match.group();
contractUrl = link.substring(link.indexOf("href") + 6);
contractUrl = contractUrl.replaceFirst("\">", "");
return contractUrl;
}
private String getMonthForInt(int m) {
String month = "invalid";
m = m - 1;
DateFormatSymbols dfs = new DateFormatSymbols();
String[] months = dfs.getMonths();
if (m >= 0 && m <= 11) {
month = months[m];
}
return month;
}
由此产生的正则表达式是:
<a.*?June.*?06.*?2013.*?>
当我使用任何在线正则表达式测试器时,按预期匹配