我希望能够使用cheerio
在源页面中的所有HTML上执行一些逻辑。我不知道如何做到这一点,因为我似乎找不到可以在cheerio library中执行此操作的方法。我认为最接近的方法是.each()
method。但是,似乎我需要首先匹配标签。我想对每一段HTML执行逻辑。
这是我到目前为止所做的:
let $ = cheerio.load(pageSource);
这是我想要应用于每个标签及其内容的功能:
summarizeContent(content) {
let contentLength = content.length;
let middle = " ....... ";
if ((contentLength> this.contentSummarizeMinLength) && (contentLength < this.contentSummarizeMaxLength)) {
let chunkIndex = 0;
const increment = 5;
while (contentLength > this.contentSummarizeMinLength) {
chunkIndex += increment;
contentLength -= 2 * increment;
}
content = content.substring(0, chunkIndex) + middle + content.substr(-chunkIndex);
} else if (contentLength >= this.contentSummarizeMaxLength) {
const chunk = 20;
content = content.substring(0, chunk) + middle + content.substr(-chunk);
}
return content;
}
因此,在HTML转换之后,如果这是我的输入,我应该有这样的输出:
<ul id="fruits">
<li class="apple">AppleAppleAppleAppleAppleAppleAppleAppleApple</li>
<li class="orange">OrangeOrangeOrangeOrangeOrangeOrangeOrange</li>
<li class="pear">PearPearPearPearPearPearPearPearPearPearPearPear</li
</ul>
输出:
<ul id="fruits">
<li class="apple">AppleApple ...... AppleApple</li>
<li class="orange">Orange ...... Orange</li>
<li class="pear">PearPear ...... PearPear</li
</ul>
<li>
标记内的内容无关紧要。我只是希望能够在HTML标记之间的某些内容中应用函数。不需要使用Cheerio,这是我发现的第一个我应该做的工作。
答案 0 :(得分:2)
The HTML page is organized as a tree of nodes. You need to walk through that tree and summarize any text nodes that you find while passing element nodes along to process their text nodes. You can use a recursive function to do this. For an element node you'll call the recursive function again, for a text node you'll call the summarizeContent
function you wrote.
//You can find a list of node types here: https://developer.mozilla.org/en/docs/Web/API/Node/nodeType
var TEXT_NODE_TYPE = 3,
ELEMENT_NODE_TYPE = 1;
function summarizeElementNode(node) {
node.contents().each(function(ix, el) {
var $el = $(el);
switch($el.nodeType) {
case TEXT_NODE_TYPE:
summarizeContent($el);
break;
case ELEMENT_NODE_TYPE:
summarizeElementNode($el);
break;
});
}
Then, you only need to call summarizeElementNode
on the document root.
var $ = cheerio.load(pageSource);
summarizeElementNode($.root());