如何从HTML表中提取特定文本?

时间:2014-01-08 09:24:32

标签: java html jsoup jaunt-api

这是我想要提取单词的HTML文件(待定,下一个上市日期(可能):,10/01/2014)。 我正在使用jaunt和JSoup。

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
      <meta http-equiv="Content-Language" content="en-us"/>
      <meta http-equiv="Content-Type" content="text/html;url=http://allahabadhighcourt.in/casestatus/utf-8"/>
      <title>Case Status Result</title>
      <link REL="StyleSheet" href="http://allahabadhighcourt.in/alldhc.css" TYPE="text/css"/>
      <script src="http://allahabadhighcourt.in/alldhc.js" LANGUAGE="JavaScript" TYPE="text/javascript">
      <!--
      -->
      </script>
   </head>
   <body onLoad="bodyOnLoad()">
      <div CLASS="heading">
         <img BORDER="0" src="http://allahabadhighcourt.in/image/titleEN.gif" WIDTH="532" HEIGHT="30" ALT="HIGH COURT OF JUDICATURE AT ALLAHABAD"/>
      </div>
      <h4 CLASS="subheading" ALIGN="center" STYLE="margin-top: 6pt; margin-bottom: 0pt">Case Status - Allahabad</h4>
      <p ALIGN="center" STYLE="margin-top: 0; margin-bottom: 6pt">
         <img BORDER="0" src="http://allahabadhighcourt.in/image/blueline.gif" WIDTH="210" HEIGHT="1"/></p>
<table ALIGN="center" CLASS="withb" WIDTH="60%" COLS="2">
<tr><td VALIGN='top' COLSPAN='2' ALIGN='right' STYLE='font-size: 18pt'>Pending</td></tr><tr><td VALIGN='top' ALIGN='center' COLSPAN='2' STYLE='font-size: 16pt'>Criminal Misc. Bail Application : 12898 of 2013 [Etah]</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Petitioner:</td><td STYLE='font-size: 14pt'>AVANISH</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Respondent:</td><td STYLE='font-size: 14pt'>STATE OF U.P.</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Counsel (Pet.):</td><td STYLE='font-size: 14pt'>SANJEEV MISHRA</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Counsel (Res.):</td><td STYLE='font-size: 14pt'>GOVT. ADVOCATE</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Category:</td><td VALIGN='top'>Criminal Jurisdiction Application-U/s 439, Cr.p.c., For Bail (major)</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Date of Filing:</td><td VALIGN='top' STYLE='font-size: 14pt'>08/05/2013</td></tr><tr><td WIDTH='35%' STYLE='font-size: 14pt'>Last Listed on:</td><td STYLE='font-size: 14pt'>03/01/2014 in Court No. 48</td></tr><tr><td WIDTH='35%' STYLE='font-size: 14pt'>Next Listing Date (Likely):</td><td STYLE='font-size: 14pt'>10/01/2014</td></tr><tr><td COLSPAN='2'></td></tr></table><p STYLE="text-align: justify; margin-top: 16pt; margin-left: 90pt; margin-right: 90pt; font-size: 10pt">This is not an authentic/certified copy of the information regarding status of a case. Authentic/certified information may be obtained under Chapter VIII Rule 30 of Allahabad High Court Rules. Mistake, if any, may be brought to the notice of OSD (Computer).</p>
      <table ALIGN="center" WIDTH="80%" COLS="1" RULES="NONE" BORDER="0" STYLE="margin-top: 16pt">
         <tbody>
            <tr ALIGN="center" VALIGN="TOP">
               <td VALIGN="TOP" ALIGN="center">
                  <img ALT="Back" src="http://allahabadhighcourt.in/image/back.gif" WIDTH="30" HEIGHT="25" BORDER="0" onClick="location.href='indexA.html'" STYLE="cursor:pointer"/>
               </td>
            </tr>
         </tbody>
      </table>
   </body>
</html>

3 个答案:

答案 0 :(得分:0)

您的html中没有可以开始解析的占位符。我建议你像这样

在表标签中添加一个“id”元素
<table id="data-table" ALIGN="center" WIDTH="80%" COLS="1" RULES="NONE" BORDER="0" STYLE="margin-top: 16pt">

并使用Jsoup来解析这样的内容。

String html = "The entire html page read as a Java String";
Document doc = Jsoup.parse(html);
Element tableElement = doc.select("#data-table");

然后遍历tableElement using the Elements API

答案 1 :(得分:0)

正如在一些评论中已经指出的那样,由于没有明显的标记属性,很难解析特定元素。但是,如果你的表总是保持相同的结构,或许有一些空值,你可以告诉Jsoup中的CSS选择器来解析某些索引的特定元素。

Document doc = do you parsing here...

Element pending = doc.select("table td:eq(0)").first();
Element nextDate = doc.select("table td:eq(0)").get(9);
Element date = doc.select("table td:eq(1)").last();

System.out.println(pending.text() + "\n" + nextDate.text() + "\n" + date.text());

将输出

Pending
Next Listing Date (Likely):
10/01/2014

注意使用伪选择器来指定元素的索引; td:eq(0)

如果每个元素都有不同的属性,您可以使用特定的属性选择器选择它们,例如[attr=value],在这种情况下类似于[VALIGN=top]。很容易看出这对你的情况不起作用。

我强烈建议您阅读有关如何使用selector-syntax来解析HTML文档的更多信息。具体阅读可以找到here

答案 2 :(得分:0)

您可以在java中使用正则表达式来执行相同操作。

UserAgent userAgent = new UserAgent();                       //create new userAgent (headless browser).
  userAgent.visit(your_site_link);                        //visit a url  
  String siteText=userAgent.doc.innerHTML().toString();

    String REGEX="(?<=>).*(?=<\\w*/td\\w*>)";
    Pattern pattern=Pattern.compile(REGEX);
    Matcher matcher =pattern.matcher(siteText);
    while(matcher.find()){
        System.out.println("TD  Datas : "+matcher.group());
    }