这是我想要提取单词的HTML文件(待定,下一个上市日期(可能):,10/01/2014)。 我正在使用jaunt和JSoup。
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Language" content="en-us"/>
<meta http-equiv="Content-Type" content="text/html;url=http://allahabadhighcourt.in/casestatus/utf-8"/>
<title>Case Status Result</title>
<link REL="StyleSheet" href="http://allahabadhighcourt.in/alldhc.css" TYPE="text/css"/>
<script src="http://allahabadhighcourt.in/alldhc.js" LANGUAGE="JavaScript" TYPE="text/javascript">
<!--
-->
</script>
</head>
<body onLoad="bodyOnLoad()">
<div CLASS="heading">
<img BORDER="0" src="http://allahabadhighcourt.in/image/titleEN.gif" WIDTH="532" HEIGHT="30" ALT="HIGH COURT OF JUDICATURE AT ALLAHABAD"/>
</div>
<h4 CLASS="subheading" ALIGN="center" STYLE="margin-top: 6pt; margin-bottom: 0pt">Case Status - Allahabad</h4>
<p ALIGN="center" STYLE="margin-top: 0; margin-bottom: 6pt">
<img BORDER="0" src="http://allahabadhighcourt.in/image/blueline.gif" WIDTH="210" HEIGHT="1"/></p>
<table ALIGN="center" CLASS="withb" WIDTH="60%" COLS="2">
<tr><td VALIGN='top' COLSPAN='2' ALIGN='right' STYLE='font-size: 18pt'>Pending</td></tr><tr><td VALIGN='top' ALIGN='center' COLSPAN='2' STYLE='font-size: 16pt'>Criminal Misc. Bail Application : 12898 of 2013 [Etah]</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Petitioner:</td><td STYLE='font-size: 14pt'>AVANISH</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Respondent:</td><td STYLE='font-size: 14pt'>STATE OF U.P.</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Counsel (Pet.):</td><td STYLE='font-size: 14pt'>SANJEEV MISHRA</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Counsel (Res.):</td><td STYLE='font-size: 14pt'>GOVT. ADVOCATE</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Category:</td><td VALIGN='top'>Criminal Jurisdiction Application-U/s 439, Cr.p.c., For Bail (major)</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Date of Filing:</td><td VALIGN='top' STYLE='font-size: 14pt'>08/05/2013</td></tr><tr><td WIDTH='35%' STYLE='font-size: 14pt'>Last Listed on:</td><td STYLE='font-size: 14pt'>03/01/2014 in Court No. 48</td></tr><tr><td WIDTH='35%' STYLE='font-size: 14pt'>Next Listing Date (Likely):</td><td STYLE='font-size: 14pt'>10/01/2014</td></tr><tr><td COLSPAN='2'></td></tr></table><p STYLE="text-align: justify; margin-top: 16pt; margin-left: 90pt; margin-right: 90pt; font-size: 10pt">This is not an authentic/certified copy of the information regarding status of a case. Authentic/certified information may be obtained under Chapter VIII Rule 30 of Allahabad High Court Rules. Mistake, if any, may be brought to the notice of OSD (Computer).</p>
<table ALIGN="center" WIDTH="80%" COLS="1" RULES="NONE" BORDER="0" STYLE="margin-top: 16pt">
<tbody>
<tr ALIGN="center" VALIGN="TOP">
<td VALIGN="TOP" ALIGN="center">
<img ALT="Back" src="http://allahabadhighcourt.in/image/back.gif" WIDTH="30" HEIGHT="25" BORDER="0" onClick="location.href='indexA.html'" STYLE="cursor:pointer"/>
</td>
</tr>
</tbody>
</table>
</body>
</html>
答案 0 :(得分:0)
您的html中没有可以开始解析的占位符。我建议你像这样
在表标签中添加一个“id”元素<table id="data-table" ALIGN="center" WIDTH="80%" COLS="1" RULES="NONE" BORDER="0" STYLE="margin-top: 16pt">
并使用Jsoup来解析这样的内容。
String html = "The entire html page read as a Java String";
Document doc = Jsoup.parse(html);
Element tableElement = doc.select("#data-table");
然后遍历tableElement using the Elements API。
答案 1 :(得分:0)
正如在一些评论中已经指出的那样,由于没有明显的标记属性,很难解析特定元素。但是,如果你的表总是保持相同的结构,或许有一些空值,你可以告诉Jsoup中的CSS选择器来解析某些索引的特定元素。
Document doc = do you parsing here...
Element pending = doc.select("table td:eq(0)").first();
Element nextDate = doc.select("table td:eq(0)").get(9);
Element date = doc.select("table td:eq(1)").last();
System.out.println(pending.text() + "\n" + nextDate.text() + "\n" + date.text());
将输出
Pending
Next Listing Date (Likely):
10/01/2014
注意使用伪选择器来指定元素的索引; td:eq(0)
。
如果每个元素都有不同的属性,您可以使用特定的属性选择器选择它们,例如[attr=value]
,在这种情况下类似于[VALIGN=top]
。很容易看出这对你的情况不起作用。
我强烈建议您阅读有关如何使用selector-syntax来解析HTML文档的更多信息。具体阅读可以找到here。
答案 2 :(得分:0)
您可以在java中使用正则表达式来执行相同操作。
UserAgent userAgent = new UserAgent(); //create new userAgent (headless browser).
userAgent.visit(your_site_link); //visit a url
String siteText=userAgent.doc.innerHTML().toString();
String REGEX="(?<=>).*(?=<\\w*/td\\w*>)";
Pattern pattern=Pattern.compile(REGEX);
Matcher matcher =pattern.matcher(siteText);
while(matcher.find()){
System.out.println("TD Datas : "+matcher.group());
}