jsoup无法返回包含所有子节点的完整节点

时间:2015-01-05 06:42:57

标签: java parsing nested jsoup elements

我有一个示例html如下:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">
 <html lang="en">
<head>
<title>example.com</title>
</head>
<body>

<div>
    <ul class="mb10">
        <li><input class="ript" name="pmtmthd" value="NOLINK"
            type="radio" id="NOLINK" reqType="ChgPaymentMtd" nodsb="true">
            <label for="NOLINK"><img
                src="https://example.com/example1.gif"
                height="23" width="147" alt="Credit Card">
                <div class="v10777" style="margin-left: 20px">Processed</div> 
          </label> </input>
        </li>
        <li><input class="ript" name="pmtmthd" value="SPLLINK"
            type="radio" id="SPLLINK" reqType="ChgPaymentMtd" nodsb="true"
            checked="checked"> <label for="SPLLINK"><img
                src="https://example.com/example2.gif"
                height="19" width="73" alt="spllink">
                </label> </input>
        </li>
      </ul>
   </div>
</body>
</html>

我正在尝试提取所有无线电元素:

List<Element> radioElements = doc.getElementsByAttributeValue("type", "radio");

输出没有任何子元素信息,如下所示:

<input class="ript" name="pmtmthd" value="NOLINK" type="radio" id="NOLINK" reqType="ChgPaymentMtd" nodsb="true" />

<input class="ript" name="pmtmthd" value="SPLLINK" type="radio" id="SPLLINK" reqType="ChgPaymentMtd" nodsb="true" checked="checked" />

如何让他们所有孩子的所有无线电元素保持完整?

1 个答案:

答案 0 :(得分:1)

Jsoup尝试规范化html,以便纠正任何错误行为(无效的html)。在input标记内放置内容是无效的html(input是一个自闭元素,不允许子项,只有属性)所以它将其删除。如果你想阻止这种规范化的发生,可以使用不同的解析器。

Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Elements radios = doc.getElementsByAttributeValue("type", "radio");
System.out.println(radios);

输出

<input class="ript" name="pmtmthd" value="NOLINK" type="radio" id="NOLINK" reqtype="ChgPaymentMtd" nodsb="true"><label for="NOLINK"><img src="https://example.com/example1.gif" height="23" width="147" alt="Credit Card">
   <div class="v10777" style="margin-left: 20px">
    Processed
   </div></img></label></input>
<input class="ript" name="pmtmthd" value="SPLLINK" type="radio" id="SPLLINK" reqtype="ChgPaymentMtd" nodsb="true" checked="checked"><label for="SPLLINK"><img src="https://example.com/example2.gif" height="19" width="73" alt="spllink" /></label></input>