我有一个示例html如下:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">
<html lang="en">
<head>
<title>example.com</title>
</head>
<body>
<div>
<ul class="mb10">
<li><input class="ript" name="pmtmthd" value="NOLINK"
type="radio" id="NOLINK" reqType="ChgPaymentMtd" nodsb="true">
<label for="NOLINK"><img
src="https://example.com/example1.gif"
height="23" width="147" alt="Credit Card">
<div class="v10777" style="margin-left: 20px">Processed</div>
</label> </input>
</li>
<li><input class="ript" name="pmtmthd" value="SPLLINK"
type="radio" id="SPLLINK" reqType="ChgPaymentMtd" nodsb="true"
checked="checked"> <label for="SPLLINK"><img
src="https://example.com/example2.gif"
height="19" width="73" alt="spllink">
</label> </input>
</li>
</ul>
</div>
</body>
</html>
我正在尝试提取所有无线电元素:
List<Element> radioElements = doc.getElementsByAttributeValue("type", "radio");
输出没有任何子元素信息,如下所示:
<input class="ript" name="pmtmthd" value="NOLINK" type="radio" id="NOLINK" reqType="ChgPaymentMtd" nodsb="true" />
<input class="ript" name="pmtmthd" value="SPLLINK" type="radio" id="SPLLINK" reqType="ChgPaymentMtd" nodsb="true" checked="checked" />
如何让他们所有孩子的所有无线电元素保持完整?
答案 0 :(得分:1)
Jsoup尝试规范化html,以便纠正任何错误行为(无效的html)。在input
标记内放置内容是无效的html(input
是一个自闭元素,不允许子项,只有属性)所以它将其删除。如果你想阻止这种规范化的发生,可以使用不同的解析器。
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Elements radios = doc.getElementsByAttributeValue("type", "radio");
System.out.println(radios);
输出
<input class="ript" name="pmtmthd" value="NOLINK" type="radio" id="NOLINK" reqtype="ChgPaymentMtd" nodsb="true"><label for="NOLINK"><img src="https://example.com/example1.gif" height="23" width="147" alt="Credit Card">
<div class="v10777" style="margin-left: 20px">
Processed
</div></img></label></input>
<input class="ript" name="pmtmthd" value="SPLLINK" type="radio" id="SPLLINK" reqtype="ChgPaymentMtd" nodsb="true" checked="checked"><label for="SPLLINK"><img src="https://example.com/example2.gif" height="19" width="73" alt="spllink" /></label></input>