如何使用jsoup解析HTML表?

时间:2015-07-23 19:31:47

标签: android html parsing html-table jsoup

我正在尝试使用jsoup解析HTML。这是我第一次使用jsoup,这对我来说有点困难。我试图解析的HTML表格如下。由于许多TR和TD,HTML表格非常复杂,我不知道如何继续选择表1中每列的名称:“组块”(表0是Topline,我不需要它)。

我只需选择“bdd,bbgen,bbtest,conn,cpu,disk,files,hobbitd,http,info,memory,msgs,ports,procs,trends”来将它们设置在xml文件的TextView标签中。这可能是使用jsoup吗?

我不得不说我正在按照以下方式对URL进行conexión:

String username = "user";
String password = "pass";
String login = username + ":" + password;
String base64login = new String(android.util.Base64.encode(login.getBytes(), android.util.Base64.NO_WRAP));
Document document = Jsoup.connect("http://example.com").header("Authorization", "Basic " + base64login).get();

HTML code:

<TABLE SUMMARY="Topline" WIDTH="100%">
<TR><TD HEIGHT=16>&nbsp;</TD></TR>  <!-- For the menu bar -->
<TR>
<TD VALIGN=MIDDLE ALIGN=LEFT WIDTH="30%">
<FONT FACE="Arial, Helvetica" SIZE="+1" COLOR="silver"><B>Xymon</B></FONT
</TD>
<TD VALIGN=MIDDLE ALIGN=CENTER WIDTH="40%">
<CENTER><FONT FACE="Arial, Helvetica" SIZE="+1" COLOR="silver"><B>Current Status</B></FONT></CENTER>
</TD>
<TD VALIGN=MIDDLE ALIGN=RIGHT WIDTH="30%">
<FONT FACE="Arial, Helvetica" SIZE="+1" COLOR="silver"><B>Thu Jul 23 16:05:06 2015</B></FONT>
</TD>
</TR>
<TR>
<TD COLSPAN=3> <HR WIDTH="100%"> </TD>
</TR>
</TABLE>
<BR>
<A NAME=hosts-blk>&nbsp;</A>


<CENTER><TABLE SUMMARY="Group Block" BORDER=0 CELLPADDING=2>
<TR><TD VALIGN=MIDDLE ROWSPAN=2><CENTER><FONT COLOR="#FFFFF0" SIZE="+1">&nbsp;</FONT></CENTER></TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45> 
<A HREF="/hobbit-cgi/hobbitcolumn.sh?bbd"><FONT COLOR="#87a9e5" SIZE="-1"><B>bbd</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?bbgen"><FONT COLOR="#87a9e5" SIZE="-1"><B>bbgen</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?bbtest"><FONT COLOR="#87a9e5" SIZE="-1"><B>bbtest</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?conn"><FONT COLOR="#87a9e5" SIZE="-1"><B>conn</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?cpu"><FONT COLOR="#87a9e5" SIZE="-1"><B>cpu</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?disk"><FONT COLOR="#87a9e5" SIZE="-1"><B>disk</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?files"><FONT COLOR="#87a9e5" SIZE="-1"><B>files</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?hobbitd"><FONT COLOR="#87a9e5" SIZE="-1"><B>hobbitd</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?http"><FONT COLOR="#87a9e5" SIZE="-1"><B>http</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?info"><FONT COLOR="#87a9e5" SIZE="-1"><B>info</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?memory"><FONT COLOR="#87a9e5" SIZE="-1"><B>memory</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?msgs"><FONT COLOR="#87a9e5" SIZE="-1"><B>msgs</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?ports"><FONT COLOR="#87a9e5" SIZE="-1"><B>ports</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?procs"><FONT COLOR="#87a9e5" SIZE="-1"><B>procs</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?trends"><FONT COLOR="#87a9e5" SIZE="-1"><B>trends</B></FONT></A> </TD>
</TR> 
<TR><TD COLSPAN=15><HR WIDTH="100%"></TD></TR>

编辑:

我尝试了这个,但它不起作用:

ArrayList<String> groupBlock = new ArrayList<String>();
Object[] objPlace;
Element table = document.select("TABLE").get(1); //select the second table:     "Group Block"
Elements rows = table.select("TR");             
for (int i = 0; i < rows.size(); i++) {
    Element row = rows.get(i);
    Elements col = row.select("TD");
    if (col.get(1).text().equals("bbd")) { //Check only one field by the moment
        groupBlock.add(col.get(1).text());  
    }
}
objPlace = groupBlock.toArray();

然后我这样做:

TextView txtGroupBlock = (TextView) findViewById(R.id.txtGroupBlock);
txtGroupBlock.setText("");
for (int i = 0; i < objPlace.length; i++) {
txtGroupBlock.append(objPlace[i].toString() + " ");
}

错误:

07-23 21:26:36.454: E/AndroidRuntime(330): FATAL EXCEPTION: AsyncTask #1
07-23 21:26:36.454: E/AndroidRuntime(330): java.lang.RuntimeException: An error occured while executing doInBackground()
07-23 21:26:36.454: E/AndroidRuntime(330):  at android.os.AsyncTask$3.done(AsyncTask.java:200)
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.util.concurrent.FutureTask$Sync.innerSetException(FutureTask.java:274)
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.util.concurrent.FutureTask.setException(FutureTask.java:125)
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:308)
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1088)
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:581)
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.lang.Thread.run(Thread.java:1019)
07-23 21:26:36.454: E/AndroidRuntime(330): Caused by: java.lang.IndexOutOfBoundsException: Invalid index 1, size is 1
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.util.ArrayList.throwIndexOutOfBoundsException(ArrayList.java:257)
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.util.ArrayList.get(ArrayList.java:311)
07-23 21:26:36.454: E/AndroidRuntime(330):  at org.jsoup.select.Elements.get(Elements.java:544)
07-23 21:26:36.454: E/AndroidRuntime(330):  at activities.monitorapp.MainActivity$Update.doInBackground(MainActivity.java:211)
07-23 21:26:36.454: E/AndroidRuntime(330):  at activities.monitorapp.MainActivity$Update.doInBackground(MainActivity.java:1)
07-23 21:26:36.454: E/AndroidRuntime(330):  at android.os.AsyncTask$2.call(AsyncTask.java:185)
07-23 21:26:36.454: E/AndroidRuntime(330):  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:306)

编辑2:

现在我有一个并行问题。我必须做类似之前的事情,但现在我有以下HTML CODE(只需按照之前的html代码,它是相同的html文件):

...
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?procs"><FONT COLOR="#87a9e5" SIZE="-1"><B>procs</B></FONT></A> </TD>
<TD ALIGN=CENTER VALIGN=BOTTOM WIDTH=45>
<A HREF="/hobbit-cgi/hobbitcolumn.sh?trends"><FONT COLOR="#87a9e5" SIZE="-1"><B>trends</B></FONT></A> </TD>
</TR> 
<TR><TD COLSPAN=15><HR WIDTH="100%"></TD></TR>

<TR class=line>
<TD NOWRAP><A NAME="hostname1">&nbsp;</A>
<FONT SIZE="+1" COLOR="#FFFFCC" FACE="Tahoma, Arial, Helvetica"><span title="127.0.0.1">hostname1</span></FONT><TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1.&amp;SERVICE=bbd"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="bbd:green:268d04h25m" TITLE="bbd:green:268d04h25m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=bbgen"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="bbgen:green:268d04h24m" TITLE="bbgen:green:268d04h24m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=bbtest"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="bbtest:green:268d04h25m" TITLE="bbtest:green:268d04h25m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=conn"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="conn:green:268d04h25m" TITLE="conn:green:268d04h25m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=cpu"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="cpu:green:169d00h15m" TITLE="cpu:green:169d00h15m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=disk"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="disk:green:268d04h25m" TITLE="disk:green:268d04h25m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=files"><IMG SRC="/hobbit/gifs/static/clear.gif" ALT="files:clear:268d04h25m" TITLE="files:clear:268d04h25m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=hobbitd"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="hobbitd:green:169d01h05m" TITLE="hobbitd:green:169d01h05m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=http"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="http:green:268d04h19m" TITLE="http:green:268d04h19m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=info"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="info:green:127.0.0.1" TITLE="info:green:127.0.0.1" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=memory"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="memory:green:268d04h25m" TITLE="memory:green:268d04h25m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=msgs"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="msgs:green:268d04h20m" TITLE="msgs:green:268d04h20m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=ports"><IMG SRC="/hobbit/gifs/static/clear.gif" ALT="ports:clear:268d04h25m" TITLE="ports:clear:268d04h25m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=procs"><IMG SRC="/hobbit/gifs/static/clear.gif" ALT="procs:clear:268d04h25m" TITLE="procs:clear:268d04h25m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=trends"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="trends:green:" TITLE="trends:green:" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
</TR>

<TR class=line>
<TD NOWRAP><A NAME="hostname2">&nbsp;</A>
<FONT SIZE="+1" COLOR="#FFFFCC" FACE="Tahoma, Arial, Helvetica"><span title="127.0.0.2">hostname2</span></FONT><TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname2&amp;SERVICE=bbd"><IMG SRC="/hobbit/gifs/static/red.gif" ALT="bbd:red:16d06h46m" TITLE="bbd:red:16d06h46m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname2&amp;SERVICE=conn"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="conn:green:16d06h46m" TITLE="conn:green:16d06h46m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname2&amp;SERVICE=http"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="http:green:16d06h46m" TITLE="http:green:16d06h46m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname2&amp;SERVICE=info"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="info:green:127.0.0.2" TITLE="info:green:127.0.0.2" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER>-</TD>
<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname2&amp;SERVICE=trends"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="trends:green:" TITLE="trends:green:" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>
</TR>

</TABLE></CENTER><BR>
<BR><BR>

在这种情况下,我必须解析两个主机名(hostname1和hostname2)以放入一个单独的TextView,但问题是主机名可以在将来更改其名称。另外,我必须在每个TD中解析“IMG SRC”,例如:

<TD ALIGN=CENTER><A HREF="/hobbit-cgi/bb-hostsvc.sh?HOST=hostname1&amp;SERVICE=http"><IMG SRC="/hobbit/gifs/static/green.gif" ALT="http:green:268d04h19m" TITLE="http:green:268d04h19m" HEIGHT="16" WIDTH="16" BORDER=0></A></TD>

我需要解析/hobbit/gifs/static/green.gif,必须在开始时附加其余网址:http://example.com/hobbit/gifs/static/green.gif以获取图片。

我知道,一旦我得到图像,我必须做一些像:

InputStream input = new java.net.URL(imgSrc).openStream();
bitmap = BitmapFactory.decodeStream(input);
ImageView logoimg = (ImageView) findViewById(R.id.logo);
logoimg.setImageBitmap(bitmap);

但我在以前的步骤中想念我......有些想法?我不知道如何开始...

1 个答案:

答案 0 :(得分:0)

问题出在这里

if (col.get(1).text().equals("bbd")) {
  groupBlock.add(col.get(i).text());  
}

你试图访问col.get(i),但我可能超出界限,这也是错误告诉你的。

如果您将索引更改为您想要的内容,那么您应该没问题。也许是这样的:

ArrayList<String> groupBlock = new ArrayList<String>();
Object[] objPlace;
Element table = document.select("TABLE").get(1); //select the second table:     "Group Block"
Elements rows = table.select("TR");             
for (int i = 0; i < rows.size(); i++) {
    Element row = rows.get(i);
    Elements cols = row.select("TD");
    for (Element col : cols){
        switch(col.text()){
        case "bbd": 
        case "bbgen":
        case "bbtest":
        //...more cases if you need them
            groupBlock.add(col.select("a").first().attr("href"));
            System.out.println(col.text()); 
            break;
        default:
            break;
        }
    }      
}
objPlace = groupBlock.toArray();

我不确定你需要什么来自DOM,但我认为你明白了。