HTML表到java对象

时间:2015-06-25 13:59:08

标签: java html xml parsing xml-parsing

我需要将html表转换为java Object。目前我无法找到任何实现此任务的好方法。表的示例如下:

<table id='table'>
  <tr>
    <th>name</th>
    <th>address</th>
  </tr>
  <tr>
    <td>
      <a href=''>name1</a>
    </td>
    <td>
      <a href=''>Address</a>
    </td>
  </tr>
 </table>

此外,我希望将表格映射到下一个对象:

public class myClass {
    public String name;
    public String address;
}

如果有人帮我完成这项任务,我将非常感激。

3 个答案:

答案 0 :(得分:2)

我认为在你的情况下你想使用Jsoup,一个很好的Java库来解析网页。一旦使用Jsoup的选择器从wepage解析了你想要的数据,用它创建Java对象应该是非平凡的。以下是一些有用的链接:

  

文件输入=新文件(&#34; table.html&#34;);

Document doc = Jsoup.parse(input, "UTF-8", "http://somewebsite.com/");

Elements row1name = doc.select("tr"); 

Elements row1address = doc.select("tr");

MyClass table1 = new MyClass(row1name, row1address);

类似的东西(选择器用于row1name和地址是错误的,你必须查看文档来验证正确的方法来做到这一点......我不记得)。我希望有所帮助。

答案 1 :(得分:2)

对于这个问题的作者,我的答案可能不会有用(我已经晚了3年,所以我认为不是正确的时机),但我认为这可能对可能遇到此答案的许多其他开发人员有用。

今天,我刚刚发布了(以我公司的名义)一个HTML到POJO的完整框架,您可以使用该框架将HTML映射到任何带有简单批注的POJO类。该库本身非常方便,并且具有许多其他功能,同时可插拔。您可以在这里查看:https://github.com/whimtrip/jwht-htmltopojo

使用方法:基础知识

想象一下,我们需要解析以下html页面:

<html>
    <head>
        <title>A Simple HTML Document</title>
    </head>
    <body>
        <div class="restaurant">
            <h1>A la bonne Franquette</h1>
            <p>French cuisine restaurant for gourmet of fellow french people</p>
            <div class="location">
                <p>in <span>London</span></p>
            </div>
            <p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>  
            <div class="meals">
                <div class="meal">
                    <p>Veal Cutlet</p>
                    <p rating-color="green">4.5/5 stars</p>
                    <p>Chef Mr. Frenchie</p>
                </div>

                <div class="meal">
                    <p>Ratatouille</p>
                    <p rating-color="orange">3.6/5 stars</p>
                    <p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
                </div>

            </div> 
        </div>    
    </body>
</html>

让我们创建我们想要映射到的POJO:

public class Restaurant {

    @Selector( value = "div.restaurant > h1")
    private String name;

    @Selector( value = "div.restaurant > p:nth-child(2)")
    private String description;

    @Selector( value = "div.restaurant > div:nth-child(3) > p > span")    
    private String location;    

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        indexForRegexPattern = 1,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Long id;

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        // This time, we want the second regex group and not the first one anymore
        indexForRegexPattern = 2,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Integer rank;

    @Selector(value = ".meal")    
    private List<Meal> meals;

    // getters and setters

}

现在还有Meal类:

public class Meal {

    @Selector(value = "p:nth-child(1)")
    private String name;

    @Selector(
        value = "p:nth-child(2)",
        format = "^([0-9.]+)\/5 stars$",
        indexForRegexPattern = 1
    )
    private Float stars;

    @Selector(
        value = "p:nth-child(2)",
        // rating-color custom attribute can be used as well
        attr = "rating-color"
    )
    private String ratingColor;

    @Selector(
        value = "p:nth-child(3)"
    )
    private String chefs;

    // getters and setters.
}

我们在github页面上对上述代码提供了更多解释。

目前,让我们看看如何将其废弃。

private static final String MY_HTML_FILE = "my-html-file.html";

public static void main(String[] args) {


    HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();

    HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);

    // If they were several restaurants in the same page, 
    // you would need to create a parent POJO containing
    // a list of Restaurants as shown with the meals here
    Restaurant restaurant = adapter.fromHtml(getHtmlBody());

    // That's it, do some magic now!

}


private static String getHtmlBody() throws IOException {
    byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
    return new String(encoded, Charset.forName("UTF-8"));

}

可以找到另一个简短的示例here

希望这会帮助某个人!

答案 2 :(得分:0)

其中一个解决方案是使用XSLT。因此,要将html数据序列化为java对象,您可以按照以下步骤进行操作:

  • 使用XSLT规则将HTML转换为XML文档。
  • 将XML反序列化为java对象。

如果您的html页面包含未关闭的标记,则此解决方案不起作用,因此您需要在第一步之前使用一些验证库。在你的html无效的情况下使用Jsoup更容易,@coderrick描述了如何在another回答中使用它。