使用HTMLUNIT

时间:2018-06-24 06:20:23

标签: java htmlunit

我正在尝试使用HTML单元从网页中提取数据。我已经通过将HtmlPage转换为文本,然后使用该HTML页面中的正则表达式提取数据来实现此目的。我还实现了使用HTML中的class属性从HTML表中提取数据。

我想为所有提取再次完全使用HtmlUnit,以学习使用正则表达式所做的相同要求。无法获取如何以键值对形式提取标记内的数据。

这是示例HTML数据

<div class="top_red_bar">
    <div id="site-breadcrumbs">
        <a href="/admin/index.jsp" title="Home">Home</a>
        &#124;
        <a href="/admin/queues.jsp" title="Queues">Queues</a>
        &#124;
        <a href="/admin/topics.jsp" title="Topics">Topics</a>
        &#124;
        <a href="/admin/subscribers.jsp" title="Subscribers">Subscribers</a>
        &#124;
        <a href="/admin/connections.jsp" title="Connections">Connections</a>
        &#124;
        <a href="/admin/network.jsp" title="Network">Network</a>
        &#124;
         <a href="/admin/scheduled.jsp" title="Scheduled">Scheduled</a>
        &#124;
        <a href="/admin/send.jsp"
           title="Send">Send</a>
    </div>
    <div id="site-quicklinks"><P>
        <a href="http://activemq.apache.org/support.html"
           title="Get help and support using Apache ActiveMQ">Support</a></p>
    </div>
</div>

<table border="0">
<tbody>
    <tr>
        <td valign="top" width="100%" style="overflow:hidden;">
            <div class="body-content">


<h2>Welcome!</h2>

<p>
Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)
</p>

<p>
You can find more information about Apache ActiveMQ on the <a href="http://activemq.apache.org/">Apache ActiveMQ Site</a>
</p>

<h2>Broker</h2>


<table>
    <tr>
        <td>Name</td>
        <td><b>localhost</b></td>
    </tr>
    <tr>
        <td>Version</td>
        <td><b>5.13.3</b></td>
    </tr>
    <tr>
        <td>ID</td>
        <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>
    </tr>
    <tr>
        <td>Uptime</td>
        <td><b>17 days 13 hours</b></td>
    </tr>
    <tr>
        <td>Store percent used</td>
        <td><b>19</b></td>
    </tr>
    <tr>
        <td>Memory percent used</td>
        <td><b>0</b></td>
    </tr>
    <tr>
        <td>Temp percent used</td>
        <td><b>0</b></td>
    </tr>
</table>

我想提取表标签之间的数据。 预期输出

Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:7 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0

如何实现?我想知道HTLM单元中要使用哪些方法来实现这一目标。

1 个答案:

答案 0 :(得分:0)

这是我遵循的步骤(不是唯一的解决方案)

  1. 通过带有假网址的parseHtml方法解析字符串
  2. 通过xpath获取第二张表
  3. 使用双嵌套循环进行迭代(for和迭代器-正确附加分隔符-)

ExtractTableData:

import java.net.URL;

import com.gargoylesoftware.htmlunit.StringWebResponse;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HTMLParser;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow.CellIterator;


public class ExtractTableData {

    public static void main(String[] args) throws Exception {

        String html = "<div class=\"top_red_bar\">\n" + "                        <div id=\"site-breadcrumbs\">\n"
                + "                            <a href=\"/admin/index.jsp\" title=\"Home\">Home</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/queues.jsp\" title=\"Queues\">Queues</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/topics.jsp\" title=\"Topics\">Topics</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/subscribers.jsp\" title=\"Subscribers\">Subscribers</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/connections.jsp\" title=\"Connections\">Connections</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/network.jsp\" title=\"Network\">Network</a>\n"
                + "                            &#124;\n"
                + "                             <a href=\"/admin/scheduled.jsp\" title=\"Scheduled\">Scheduled</a>\n"
                + "                            &#124;\n" + "                            <a href=\"/admin/send.jsp\"\n"
                + "                               title=\"Send\">Send</a>\n" + "                        </div>\n"
                + "                        <div id=\"site-quicklinks\"><P>\n"
                + "                            <a href=\"http://activemq.apache.org/support.html\"\n"
                + "                               title=\"Get help and support using Apache ActiveMQ\">Support</a></p>\n"
                + "                        </div>\n" + "                    </div>\n" + "\n"
                + "                    <table border=\"0\">\n" + "                        <tbody>\n"
                + "                            <tr>\n"
                + "                                <td valign=\"top\" width=\"100%\" style=\"overflow:hidden;\">\n"
                + "                                    <div class=\"body-content\">\n" + "\n" + "\n"
                + "<h2>Welcome!</h2>\n" + "\n" + "<p>\n"
                + "Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)\n"
                + "</p>\n" + "\n" + "<p>\n"
                + "You can find more information about Apache ActiveMQ on the <a href=\"http://activemq.apache.org/\">Apache ActiveMQ Site</a>\n"
                + "</p>\n" + "\n" + "<h2>Broker</h2>\n" + "\n" + "\n" + "<table>\n" + "    <tr>\n"
                + "        <td>Name</td>\n" + "        <td><b>localhost</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                + "        <td>Version</td>\n" + "        <td><b>5.13.3</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                + "        <td>ID</td>\n" + "        <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>\n"
                + "    </tr>\n" + "    <tr>\n" + "        <td>Uptime</td>\n"
                + "        <td><b>17 days 13 hours</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                + "        <td>Store percent used</td>\n" + "        <td><b>19</b></td>\n" + "    </tr>\n"
                + "    <tr>\n" + "        <td>Memory percent used</td>\n" + "        <td><b>0</b></td>\n"
                + "    </tr>\n" + "    <tr>\n" + "        <td>Temp percent used</td>\n" + "        <td><b>0</b></td>\n"
                + "    </tr>\n" + "</table>";
        WebClient webClient = new WebClient();
        HtmlPage page = HTMLParser.parseHtml(new StringWebResponse(html, new URL("http://dummy.url.for.parsing.com/")),
                webClient.getCurrentWindow());

        final HtmlTable table = (HtmlTable) page.getByXPath("//table").get(1);

        for (final HtmlTableRow row : table.getRows()) {

            CellIterator cellIterator = row.getCellIterator();

            if (cellIterator.hasNext()) {
                System.out.print(cellIterator.next().asText());
                while (cellIterator.hasNext()) {
                    System.out.print(":" + cellIterator.next().asText());
                }
            }
            System.out.println();
        }

    }

}

输出:

Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:17 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0