我正在尝试使用Jsoup解析Javadocs,但是我无法提取blockquote
标记中包含的文本。
以下是我尝试解析的HTML示例:
<P>
The <code>String</code> class represents character strings. All
string literals in Java programs, such as <code>"abc"</code>, are
implemented as instances of this class.
<p>
Strings are constant; their values cannot be changed after they
are created. String buffers support mutable strings.
Because String objects are immutable they can be shared. For example:
<p><blockquote><pre>
String str = "abc";
</pre></blockquote><p>
is equivalent to:
<p><blockquote><pre>
char data[] = {'a', 'b', 'c'};
String str = new String(data);
</pre></blockquote><p>
Here are some more examples of how strings can be used:
<p><blockquote><pre>
System.out.println("abc");
String cde = "cde";
System.out.println("abc" + cde);
String c = "abc".substring(2,3);
String d = cde.substring(1, 2);
</pre></blockquote>
<p>
我正在尝试使用此代码解析p
标记中包含的文字:
Document doc = Jsoup.parse(new File("/home/facetoe/ebooks/Java/docs/api/java/lang/String.html"), "UTF-8");
Elements para = doc.getElementsByTag("P");
for ( Element element : para ) {
System.out.println(element);
}
然而,无论我尝试什么,blockquote
标签中包含的文字都会消失。
以下是我得到的输出示例:
<p> The <code>String</code> class represents character strings. All string literals in Java programs, such as <code>"abc"</code>, are implemented as instances of this class. </p>
<p> Strings are constant; their values cannot be changed after they are created. String buffers support mutable strings. Because String objects are immutable they can be shared. For example: </p>
<p></p>
<p> is equivalent to: </p>
<p></p>
<p> Here are some more examples of how strings can be used: </p>
<p></p>
<p> The class <code>String</code> includes methods for examining individual characters of the sequence, for comparing strings, for searching strings, for extracting substrings, and for creating a copy of a string with all characters translated to uppercase or to lowercase. Case mapping is based on the Unicode Standard version specified by the <a href="../../java/lang/Character.html" title="class in java.lang"><code>Character</code></a> class. </p>
<p> The Java language provides special support for the string concatenation operator ( + ), and for conversion of other objects to strings. String concatenation is implemented through the <code>StringBuilder</code>(or <code>StringBuffer</code>) class and its <code>append</code> method. String conversions are implemented through the method <code>toString</code>, defined by <code>Object</code> and inherited by all classes in Java. For additional information on string concatenation and conversion, see Gosling, Joy, and Steele, <i>The Java Language Specification</i>. </p>
<p> Unless otherwise noted, passing a <tt>null</tt> argument to a constructor or method in this class will cause a <a href="../../java/lang/NullPointerException.html" title="class in java.lang"><code>NullPointerException</code></a> to be thrown. </p>
<p>A <code>String</code> represents a string in the UTF-16 format in which <em>supplementary characters</em> are represented by <em>surrogate pairs</em> (see the section <a href="Character.html#unicode">Unicode Character Representations</a> in the <code>Character</code> class for more information). Index values refer to <code>char</code> code units, so a supplementary character uses two positions in a <code>String</code>. </p>
<p>The <code>String</code> class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., <code>char</code> values). </p>
<p> </p>
就像Jsoup只是删除blockquote
标签中的任何内容。有谁知道如何保留这些标签并从中提取文本?
答案 0 :(得分:1)
原因是Jsoup构建DOM,使得blockquote元素在段落之外。您可以通过打印doc对象来查看。我认为blockquote元素会自动终止前一个p元素(不需要关闭p标记)。如果在现代浏览器中加载html并检查元素,则可以观察到同样的事情。
另见HTML 4.01 specification - “P元素代表一个段落。它不能包含块级元素(包括P本身)。”我确信在HTML5中有类似的内容。
因此,只通过段落迭代,您就错过了未包含在其中的块引用。
答案 1 :(得分:0)
查看解析方法的JSoup documentation,看起来他们会使用whitelist机制来决定哪些是安全的,哪些是不安全的。也许您需要在解析之前设置一个while列表? 虽然这似乎只适用于干净的方法。所以它可能是别的东西。
答案 2 :(得分:0)
你不能关闭你的&lt; p&gt;标签,这可能是问题。