如何在Jsoup解析中避免使用周围的html头标记

时间:2014-10-03 05:36:51

标签: java html parsing jsoup

使用Jsoup我尝试解析给定的html内容。在Jsoup.parse()之后,html输出将html,head和body标记附加到输入。我只是想忽略这些。

示例输入:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

Java代码:

import java.io.File;
import java.io.IOException;

import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class HTMLParse {

    public static void main(String args[]) throws IOException {
        try{
            File input = new File("/ab.html");
            String html = FileUtils.readFileToString(input, null);

            Document doc = Jsoup.parseBodyFragment(html);
            doc.outputSettings().prettyPrint(false);
            System.out.println(doc.html());
        }
        catch(Exception e){
            e.printStackTrace();
        }
    }
}

实际输出:

<html><head></head><body><p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
    </body></html>

预期输出:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

请帮忙。

4 个答案:

答案 0 :(得分:16)

原因:

parseBodyFragment()以及所有其他parse() - 方法使用默认 HTML解析器 。那些添加总是 HTML-Shell(<html>…</html><head>…</head>等。)。

解决方案:

不要使用HTML解析器,而是使用 XML解析器 ; - )

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

替换该单行并解决您的问题。

实施例

final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";

Document docHtml = Jsoup.parse(html);
Document docXml = Jsoup.parse(html, "", Parser.xmlParser());

System.out.println("******* HTML *******\n" + docHtml);
System.out.println();
System.out.println("*******  XML *******\n" + docXml);

<强>输出:

******* HTML *******
<html>
 <head></head>
 <body>
  <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
 </body>
</html>

*******  XML *******
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

答案 1 :(得分:5)

要获得预期的输出,它实际上是:

final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);

System.out.println(doc.body().html());

答案 2 :(得分:3)

You can try using the XML parser, but this doesn't always work because HTML is not always XML; it often has unterminated tags like <?xml version="1.0" encoding="utf-8"?> <RelativeLayout xmlns:android="http://schemas.android.com/apk/res/android" android:orientation="vertical" android:layout_width="fill_parent" android:layout_height="wrap_content" android:gravity="fill_horizontal" > <ProgressBar android:id="@+id/progressBar" style="?android:attr/progressBarStyleHorizontal" android:layout_height="wrap_content" android:layout_width="match_parent" android:progress="50" /> </RelativeLayout> and <?xml version="1.0" encoding="utf-8"?> <LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" xmlns:app="http://schemas.android.com/apk/res-auto" xmlns:tools="http://schemas.android.com/tools" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical"> <android.support.v7.widget.Toolbar android:id="@+id/activity_toolbar" android:layout_width="match_parent" android:layout_height="wrap_content"> </android.support.v7.widget.Toolbar> </LinearLayout> . It's better to stick with the HTML parser. You can rely on there being @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_view); Toolbar myToolbar = (Toolbar) findViewById(R.id.activity_toolbar); setSupportActionBar(myToolbar); getSupportActionBar().setDisplayOptions(ActionBar.DISPLAY_HOME_AS_UP | ActionBar.DISPLAY_SHOW_CUSTOM); View progress = getLayoutInflater().inflate(R.layout.progressbar_layout, null); getSupportActionBar().setCustomView(progress); } , <img>, and <br> tags and they are easy to discard. Just get your fragment of HTML by selecting the body tag and ask for its HTML.

<html>

答案 3 :(得分:0)

您还可以将Jsoup.parse与HTML解析器一起使用。您需要做的就是剥去htmlbody包装纸。

这可以通过选择body元素并展开来完成:

String input = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Node content = Jsoup.parse(input).body().unwrap();
System.out.println(content.html());

通过body()选择body元素,并通过unwrap()删除正文,仅保留内容。

所以输出是:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>