Jsoup解析表3次?

时间:2013-07-15 16:00:44

标签: java html csv jsoup

我有这个奇怪的问题,我在我的智慧结束。也许一双新鲜的眼睛可以解决这个问题!

我正在使用jSoup来解析HTML文件问题是即使在写入新文件时,也会将表集输出到文件3-4次。它第一次作为.csv文件中的一条直线输出,但每隔一次它的格式完全符合我的要求。但是我很明显第一次想要它,并且第一次有这样的感觉!

我的代码:

Document doc = new Document(file.toString());
    doc = Jsoup.parse(file, null);

    Elements tables = doc.select("table");

    for (Element table: tables) {
        Elements rows = table.select("tr");
        for (Element row: rows) {
            Elements cells = row.getElementsByTag("td");
            StringBuffer values = new StringBuffer();
            for (Element cell: cells) {
                String cellText = cell.text();
                cellText = cellText.replaceAll(",", "");
                cellText = cellText.replaceAll("£", ",£");
                cellText = cellText.replaceAll(",£", "£");
                System.out.println(cellText);
                values.append(cellText + ",");
            }
            System.out.println(values.toString());
            addToFile(values + ",");
        }
    }

// add new data to mySNMPResults file
private static void addToFile(String myString) { // add newest entry to .csv
                                                    // file
    try {
        BufferedWriter out = new BufferedWriter(new FileWriter(
                "MyParsedDOMTree.csv", true));
        out.write(myString + "\n");
        out.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

它也可能只是一个复杂的HTML文件,各种表互相嵌套的情况,但我不知道这是如何导致数字数据表只出现一次输出三次...

修改

HTML片段:

<tr bgcolor = "#EEEEEE" height = 20 >
<td width = 15% >
<font face="tahoma" size="1">
Dept '<b>Food Incl Vat</b>'
</td>
<td width = 10% align =
right><font face="tahoma" size="1">
£688.95
</td>
<td width = 10% align =
right><font face="tahoma" size="1">
£642.60
</td>
<td width = 10% align =
right><font face="tahoma" size="1">
£767.95
</td>
<td width = 10% align =
right><font face="tahoma" size="1">
£3,007.00
</td>
<td width = 10% align =
right><font face="tahoma" size="1">
£1,525.60
</td>
<td width = 10% align =
right><font face="tahoma" size="1">
£1,970.40
</td>
<td width = 10% align =
right><font face="tahoma" size="1">
£353.00
</td>
<td width = 1%></td><td width
= 14% align = right bgcolor = "#DFDFDF"><font face="tahoma" size="1" color = '#444444'>
<b>£8,955.50</b></td>
</tr>

1 个答案:

答案 0 :(得分:1)

编辑:抱歉代码中有错误。现在修好了。

我真的没有足够的代码来进行可靠的猜测,但我不确定为什么你要尝试获取表的大小然后经过那个表多次.size()得到你(我猜3-4)。你想要找到表的根,然后在根下将是表的名称(表的类名应该是相同的),然后在每个表中搜索你想要找到的任何内容。也许一些代码会有所帮助:)

HTML:

    <ul class="ListOfTables">
           <li class="TABLE">
                 <span class="item">
           <li class="TABLE">
                 <span class="item">
           <li class="TABLE">
                 <span class="item">
           <li class="TABLE">
                 <span class="item">

Java代码:

public void searchForItems(Document doc)
{
    Elements tables = doc.select("li[class=TABLE]");
    for (Element table : tables)
    {

        String item;
        Elements itemsInTable = table.select("span[class=item]");
        item = itemsIntTable.text();


        //Write the item to file. Depending on what is in your table, you might
        //have to write a more complex scan. Looking for things like attributes
    }
}