Question

我正在尝试获取html页面的正文内容。

假设这个html文件：

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <link href="../Styles/style.css" rel="STYLESHEET" type="text/css" />

  <title></title>
</head>

<body>
<p> text 1 </p>
<p> text 2 </p>
</body>
</html>

我想要的是：

<p> text 1 </p> 
<p> text 2 </p>

所以，我认为使用SAXParser会这样做（如果你知道更简单的方法请告诉我）

这是我的代码，但我总是将 null 作为正文内容：

private final String HTML_NAME_SPACE = "http://www.w3.org/1999/xhtml";
private final String HTML_TAG = "html";
private final String BODY_TAG = "body";
public static void parseHTML(InputStream in, ContentHandler handler) throws IOException, SAXException, ParserConfigurationException
{
    if(in != null)
    {
        try
        {
            SAXParserFactory parseFactory = SAXParserFactory.newInstance();
            XMLReader reader = parseFactory.newSAXParser().getXMLReader();
            reader.setContentHandler(handler);
            InputSource source = new InputSource(in);
            source.setEncoding("UTF-8");
            reader.parse(source);
        }
        finally
        {
            in.close();
        }
    }
}

public ContentHandler constrauctHTMLContentHandler()
{
    RootElement root = new RootElement(HTML_NAME_SPACE, HTML_TAG);
    root.setStartElementListener(new StartElementListener() 
        {           
        @Override
        public void start(Attributes attributes) 
        {           
            String body = attributes.getValue(BODY_TAG);
            Log.d("html parser", "body: " + body);
        }
    });
return root.getContentHandler();
}

然后

parseHTML(inputStream, constrauctHTMLContentHandler()); // inputStream is html file as stream

这段代码有什么问题？

Answer 1

如何使用Jsoup？您的代码可能看起来像

Document doc = Jsoup.parse(html);
Elements elements = doc.select("body").first().children();
//Elements elements = doc.select("p");//or only `<p>` elements
for (Element el : elements)
    System.out.println("element: "+el);

Answer 2

不确定你是如何抓取HTML的。如果是本地文件，则可以将其直接加载到Jsoup中。如果你必须从某个URL获取它，那么我通常使用Apache的HttpClient。这里有一个快速入门指南：HttpClient并且很好地帮助您入门。

这将允许您将数据恢复为以下内容：

HttpClient client = new DefaultHttpClient();
HttpPost post = new HttpPost(URL);
//
// here you can do things like add parameters used when connecting to the remote site    
//
HttpResponse response = client.execute(post);
BufferedReader rd = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));

然后（正如Pshemo所建议的那样）我使用Jsoup来解析和提取数据Jsoup

Document document = Jsoup.parse(HTML);
// OR
Document doc = Jsoup.parseBodyFragment(HTML);
Elements elements = doc.select("p");  // p for <p>text</p>

在java中获取html文件的正文内容

2 个答案: