尝试从内部论坛获取一些html源代码。 为了独立,我们使用nodejs,express和类似的东西。
当我直接打开页面时,我得到以下html:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="content-type" content="text/html; charset=us-ascii" />
<meta name="description" content="myForum" />
<meta name="viewport" content="width=320; user-scalable=no" />
<title>myForum</title>
</head>
<body>
<table>
<tr>
<td align="left" valign="top" width="100%">
<center>
<h1><img class="banner" src=
"./img/myForum.jpg" width="730"
height="117" border="0" alt="myForum" /></h1>
</center>
<hr />
<center>
[ <a href="answer.php?id=975710">Antworten</a> ] [
<a href="index.php">Forum</a> ] [ <a href=
"newEntries.php">Neue Beiträge</a> ]
</center>
<hr />
<h1>sCHween</h1>geschrieben von <font color=
"#FFFFFF">User1</font> am 18.06.2014 um 21:26:15
<hr />
This is my text! It could contain images and links!
<img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png" /><br />
<a href="http://www.google.com/">Google</a>
<br />
<hr />
<b>Antworten:</b><br />
<a href="thread.php?id=9752">Re:
sCHween</a> - <b><font color=
"#FFFFFF">User2</font></b> - 18.06.2014 22:56:27<br />
<a href="showentry.php?id=9756">Re:
sCHween</a> - <b><font color=
"#FFFFFF">User2</font></b> - 18.06.2014 23:14:44<br />
<a href="showentry.php?id=9753">Re:
sCHween</a> - <b><font color=
"#FFFFFF">User1</font></b> - 18.06.2014 23:02:21<br />
<a href="showentry.php?id=975713">Re:
sCHween</a> - <b><font color=
"#FFFFFF">User1</font></b> - 18.06.2014 21:46:13<br />
<a href="showentry.php?id=9720">Re:
sCHween</a> - <b><font color=
"#FFFFFF">User3</font></b> - 18.06.2014 22:22:25<br />
<a href="showentry.php?id=9755">Re:
sCHween</a> - <b><font color=
"#FFFFFF">User4</font></b> - 18.06.2014 21:52:51<br />
<hr />
<span>
<a href="answer.php?id=975">Antworten</a><br />
<a href="recent.php">Neue Beiträge</a><br />
</span>
<hr />
</td>
</tr>
</table>
</body>
</html>
我们想要得到的是两个hr标签之间的事物的html源:
This is my text! It could contain images and links!
<img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png" /><br />
<a href="http://www.google.com/">Google</a>
是否有一种简单的方法可以在两个hr标签之间获取源代码,或者提取此内容的最简洁方法是什么?
答案 0 :(得分:0)
jsdom是在节点中进行DOM解析的绝佳工具。由于您希望文本节点和常规元素都转换为字符串,因此我们必须区分两者:
var jsdom = require("jsdom");
jsdom.env(
'http://example.com',
['http://code.jquery.com/jquery.js'],
function (errors, window) {
var $hr = window.$('hr'),
node = $hr.get(2).nextSibling,
endNode = $hr.get(3),
html = '';
while (node && node !== endNode) {
if (node.nodeType === 3) {
html += node.textContent;
} else {
html += node.outerHTML;
}
node = node.nextSibling;
}
}
);
现在html
具有以下值:
This is my text! It could contain images and links!
<img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png"><br>
<a href="http://www.google.com/">Google</a>
<br>
答案 1 :(得分:0)
不确定如果这是你想要的:
Jquery的:
var AllContent = $("td").contents();
var hrCount = 0;
var addContent = false;
var result="";
AllContent.each(function(){
if ($(this).prop('tagName') == "HR"){
hrCount++;
if (hrCount ==3){
addContent = true;
}
if (hrCount ==4){
addContent = false;
}
}else{
if(addContent){
if (typeof $(this).html() != "undefined"){
result+=$(this)[0].outerHTML;
}else{
result+=$(this).text();
}
}
}
});
alert(result);
必须是更清洁的解决方案......