我正在尝试使用Java DocumentBuilder()阅读网站(HTML),它正在阅读但是当有html £
“
符号或任何其他html特殊字符时。它会在特殊字符后停止读取任何内容,而不是返回null。许多其他人也提出了类似的问题。但对此没有任何建设性的答案。如果有人知道解决这个问题的方法,请告诉我。请在这里找到我的代码。
<html>
<body>
<p>
它从488英镑增加到600英镑</p>
<p>
Ronals说:“这方面的学校正在堕落”</p>
</body>
</html>
为了阅读这些,我写了下面的代码。
private String extractTheTitle(String responseBody) throws Exception {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
ByteArrayInputStream encXML = new ByteArrayInputStream(responseBody.getBytes("UTF8"));
Document embeddedDoc = builder.parse(encXML);
NodeList titleNodes = embeddedDoc.getElementsByTagName("p");
if (titleNodes != null && titleNodes.getLength() > 0) {
for(int i = 0; i<titleNodes.getLength(); i++) {
Element aTitleElement = (Element) titleNodes.item(i);
aTitleElement.normalize();
Node titleContent = aTitleElement.getFirstChild();
String nodeText = titleContent.getNodeValue();
myArrlist.add(i , "<p>"+nodeText+"</p>");
}
}
}
上面的代码在£之后没有输出任何内容,或者“我尝试了很多方法,但没有任何效果。如果有人知道任何答案,请告诉我。我从以下网站获得了帮助。但这没有帮助。我不想删除html特殊字符。因为我正在阅读这些p标签并使用那些
标签重新构建我自己的html页面。
http://www.developerfeed.com/xml/common/issues/xml-parsing-failing-due-encoding-not-being-utf-8
答案 0 :(得分:1)
每个aTitleElement(<p>...</p>
)包含多个Node,其中一个是实体。因此,而不是getFirstChild必须迭代所有孩子; normalize对那里没有帮助。
StringBuilder pText = new StringBuilder();
NodeList children = aTitleElement.getChildNodes();
for (int j = 0; j < children.getLength(); ++j) {
Node child = children.item(j);
if (child.getNodeType() == Node.ENTITY_REFERENCE_NODE) {
...
}
pText.append(child.getNodeValue());
}
nodeText = pText.toString();
测试文件
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title></title>
</head>
<body>
<p>Saluton,£“ mondo!</p>
</body></html>
我的代码
DocumentBuilder builder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document embeddedDoc = builder.parse(new File("/home/joop/test.html"));
NodeList pNodes = embeddedDoc.getElementsByTagName("p");
StringBuilder pText = new StringBuilder();
for (int i = 0; i < pNodes.getLength(); ++i) {
Element pElement = (Element) pNodes.item(i);
NodeList children = pElement.getChildNodes();
for (int j = 0; j < children.getLength(); ++j) {
Node child = children.item(j);
String value = child.getNodeValue();
if (value == null) {
System.out.println("node name=" + child.getNodeName()
+ ": " + child.getNodeType());
}
pText.append(value);
}
pText.append("\n");
}
String text = pText.toString();
System.out.println("FOUND TEXT:");
System.out.println(text);
的结果强> 的
FOUND TEXT:
Saluton,£“ mondo!
答案 1 :(得分:0)
将提取数据的代码。请使用网站网址。
* 新代码
public void process() {
HttpGet getMethod = new HttpGet("URL OF THE WEB SITE GOES HERE");
try {
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String websiteBody = client.execute(getMethod, responseHandler);
String title = extractBody(websiteBody);
}
}
private String extractBody(String responseBody) throws Exception {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document embeddedDoc = builder.parse(new InputSource(new StringReader(responseBody)));
//ByteArrayInputStream encXML = new ByteArrayInputStream(responseBody.getBytes("UTF8"));
//Document embeddedDoc = builder.parse(encXML);
//Document embeddedDoc = builder.parse(new File("/home/joop/test.html"));
NodeList pNodes = embeddedDoc.getElementsByTagName("p");
StringBuilder pText = new StringBuilder();
for (int i = 0; i < pNodes.getLength(); ++i) {
Element pElement = (Element) pNodes.item(i);
NodeList children = pElement.getChildNodes();
for (int j = 0; j < children.getLength(); ++j) {
Node child = children.item(j);
String value = child.getNodeValue();
if (value == null) {
System.out.println("node name=" + child.getNodeName()
+ ": " + child.getNodeType());
value = value+convert(child.getNodeName());
}
System.out.println(value.replaceAll("null", ""));
pText.append(value);
}
pText.append("\n");
}
String text = pText.toString();
System.out.println("FOUND TEXT:");
System.out.println(text);
}