我有HTML字符串,我需要从中提取HTML子字符串(摘要,关键字等)。字符串如下:
const content = "<p>
<strong>Summary</strong><br />Some text with <strong>HTML</strong> tags...<br /><br />
<strong>Keywords</strong> keyword1, keyword2,...<br /><br />
...
</p>"
目标是获得:
summary = "<br />Some text with <strong>HTML</strong> tags...<br /><br />"
keywords = "keyword1, keyword2,..."
对于解析,我使用了Cheerio库,该库可以在解析的HTML代码上使用jQuery方法。我已经尝试过以下方法,但都不起作用:
简单的nextUntil():
const $ = cheerio.load(content);
console.log($("strong:contains('Summary')").nextUntil( "strong:contains('Keywords')" ).html());
// Returns: "Summary"
nextUntil()与foreach:
const $ = cheerio.load(content);
let container = $('<container/>');
for (let i = 0; i < $("strong:contains('Summary')").nextUntil( "strong:contains('Keywords')" ).length; i++) {
container.append($("strong:contains('Summary')").nextUntil( "strong:contains('Keywords')" )[i]);
}
console.log('container: ', container.html());
// Returns: "<strong>Summary</strong>"
答案 0 :(得分:2)
使用nextUntil()
的方法不起作用,因为给定的<strong>
DOM元素没有包含任何可用内容(html)的同级元素。相反,只有textContent可以作为父级<p>
元素的一部分找到。
我们将必须采用某种正则表达式匹配方法,如下所示(请注意,如果Summary
和Keywords
部分出现多次,则仅最新出现的事件< / em>中的每一个)。
const content = $("<p>\n\
<strong>Summary</strong><br />Some text with\n\ <strong>HTML</strong> tags...<br /><br />\n\
<strong>Keywords</strong> keyword1, keyword2,...<br /><br />\n\
...\n\
</p>").html(); // I user jquery-html() to extract the innerHTML of the outer <p> element
const arr=content.split(/<strong>(Summary|Keywords)<\/strong>/);
for (var i=1;i<arr.length;i+=2) window[arr[i]]=arr[i+1];
console.log('\nsummary:',Summary,'\nkeywords:',Keywords);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
答案 1 :(得分:1)
我认为问题出在“摘要”和“关键字”文本不是其各自标题的兄弟姐妹。
您可以改为使用正则表达式解析HTML字符串
<button id="startguess">Guess</button>
<input id="mynumber" type="number"></input>
答案 2 :(得分:1)
这是另一种方法; hacky,但可以正常工作:
const content = `<p>
<strong>Summary</strong><br />Some text with <strong>HTML</strong> tags...<br /><br />
<strong>Keywords</strong> keyword1, keyword2,...<br /><br />
...
</p>`,
html = $(content);
const summary = getHtml(html.find("strong:contains(Summary)"));
const keywords = getHtml(html.find("strong:contains(Keywords)"));
console.log(summary);
console.log(keywords);
function getHtml(html) {
const summary = [];
let currentEl = html.prop("nextSibling");
while (true) {
// If the current and next element are both <br>, the end is reached
if (currentEl.tagName === "BR" && currentEl.nextSibling.tagName === "BR") {
// If this is "Keywords", don't add the trailing <br> elements
if (html.text().trim() !== "Keywords") {
// summary.push("<br><br>") would also work here
summary.push(currentEl.outerHTML, currentEl.nextSibling.outerHTML);
}
return summary.join("").trim();
} else {
// nodeType 1 = element
// nodeType 3 = text
const content = currentEl.nodeType === 1 ? currentEl.outerHTML : currentEl.textContent;
// Push HTML string and continue
summary.push(content);
currentEl = currentEl.nextSibling;
}
}
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>