Question

我从抓取的html文件中做这个praser。该解析器假设提取出线程标题，用户帖子和总视图。我设法得到html标签，但问题是它无法检索所有的线程标题，而只是得到一些。

html代码（对不起我从网站源代码中复制的不良对齐）：

<tbody id="threadbits_forum_2">

<tr>
<td class="alt1" id="td_threadstatusicon_3396832">

    <img src="http://www.hardwarezone.com.sg/img/forums/hwz/statusicon/thread_hot.gif" id="thread_statusicon_3396832" alt="" border="" />
</td>

    <td class="alt2">&nbsp;</td>


<td class="alt1" id="td_threadtitle_3396832" title="Updated on 3 October 2011  

Please check Price Guides for latest prices 

 A PC Buyer&#8217;s Guide that is everything to everyone is simply not possible. This     is a simple guide to putting together a PC with a local flavour. Be sure to read PC Buyer&#8217;s Guide from other media.  

If you have any...">


    <div>

            <span style="float:right">






                 <img class="inlineimg" src="http://www.hardwarezone.com.sg/img/forums/hwz/misc/sticky.gif" alt="Sticky Thread" /> 
            </span>



        <font color=red><b>Sticky: </b></font>


        <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832" id="thread_title_3396832">Buyer's Guide II: Extreme, High-End, Mid-Range, Budget, and Entry Level Systems - Part 2</a>
        <span class="smallfont" style="white-space:nowrap">(<img class="inlineimg" src="http://www.hardwarezone.com.sg/img/forums/hwz/misc/multipage.gif" alt="Multi-page thread" border="0" />  <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832">1</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=2">2</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=3">3</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=4">4</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=5">5</a> ... <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=17">Last Page</a>)</span>
    </div>



    <div class="smallfont">


            <span style="cursor:pointer" onclick="window.open('member.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;u=39963', '_self')">adrianlee</span>

    </div>

到目前为止我的编码：

 try(BufferedReader br = new BufferedReader(new FileReader(pageThread)))
    {
        String html = "";

        while(br.readLine() != null)
        {
            html += br.readLine() + "\n";
        }

        Document doc = Jsoup.parse(html);
        //To get the thread list

        Elements threadsList = doc.select("tbody[id^=threadbits_forum]").select("tr");

        for(Element e: threadsList)
        {
            //To get the title
            System.out.println("Title: " + e.select("a[id^=thread_title]").text());
        }

        System.exit(0);

    }catch(Exception e)
    {
        e.printStackTrace();
    }

结果：标题：

标题：想成为HardwareZone编辑团队的一员吗？
标题：
标题：pa9797是一台新的Rig !! [/ li>]
标题：[EPIC] Andyson的另一个第一个，Platinum Modular PSU
标题：
标题：SLS中的哪些商店适合购买新的cpu？。。。所以

您是否有针对此问题的解决方法？

感谢。

Answer 1

使用Jsoup解析网页时，您应该首先以正确的方式获取Web文档。并不是说你的方式是错的，但是你使自己变得比以前更难。

要创建网页的Document对象，请从

开始

String url = "www.google.com";
Document doc = Jsoup.connect(url).get();

从此文档中，您可以选择论坛的主题标题。直接来自食谱的另一个例子是href链接。

Elements links = doc.select("a[href]"); //a with href

如果你没有得到你想要的元素，那么你的选择是不正确的。

在此处选择<tr> - 所有<tbody> - 元素中具有以threadbits_forum开头的ID的元素。

Elements threadsList = doc.select("tbody[id^=threadbits_forum]").select("tr");

由于我不知道您要解析的论坛，我只能查看其他可能具有类似HTML格式的threadbits论坛。

如果您查看此网站http://forums.hardwarezone.com.sg/corbell-ecustomer-service-center-166/，则可以看到所有主题都位于名为<td>的{{1}}类中。

如果你只选择这个，你将获得使用相同类的用户名和其他东西，但由于你只需要线程标题，你必须选择alt1 - 标记。

最终得到以下选择查询

<a>

在此论坛上模拟您的原始问题，您可以这样做：

Elements titles = doc.select("td.alt1  a[id^=thread_title]");

将产生标题：

    String html = "http://forums.hardwarezone.com.sg/corbell-ecustomer-service-center-166/";
    Document doc = Jsoup.connect(html).get();
    Elements titles = doc.select("td.alt1  a[id^=thread_title]");
    for (Element e : titles) {
        System.out.println(e.text());
    }

希望这可以帮助您正确选择！

Jsoup：从论坛获取主题标题

1 个答案: