获取两个元素之间的内容

时间:2019-08-28 14:03:24

标签: javascript jquery cheerio

我有HTML字符串,我需要从中提取HTML子字符串(摘要,关键字等)。字符串如下:

const content = "<p>
<strong>Summary</strong><br />Some text with <strong>HTML</strong> tags...<br /><br />
<strong>Keywords</strong> keyword1, keyword2,...<br /><br />
...
</p>"

目标是获得:

summary = "<br />Some text with <strong>HTML</strong> tags...<br /><br />"
keywords = "keyword1, keyword2,..."

对于解析,我使用了Cheerio库,该库可以在解析的HTML代码上使用jQuery方法。我已经尝试过以下方法,但都不起作用:

简单的nextUntil():

const $ = cheerio.load(content);
console.log($("strong:contains('Summary')").nextUntil( "strong:contains('Keywords')" ).html());
// Returns: "Summary" 

nextUntil()与foreach:

const $ = cheerio.load(content);
let container = $('<container/>');
for (let i = 0; i < $("strong:contains('Summary')").nextUntil( "strong:contains('Keywords')" ).length; i++) {
  container.append($("strong:contains('Summary')").nextUntil( "strong:contains('Keywords')" )[i]);
}
console.log('container: ', container.html());
// Returns: "<strong>Summary</strong>" 

3 个答案:

答案 0 :(得分:2)

使用nextUntil()的方法不起作用,因为给定的<strong> DOM元素没有包含任何可用内容(html)的同级元素。相反,只有textContent可以作为父级<p>元素的一部分找到。

我们将必须采用某种正则表达式匹配方法,如下所示(请注意,如果SummaryKeywords部分出现多次,则仅最新出现的事件< / em>中的每一个)。

const content = $("<p>\n\
<strong>Summary</strong><br />Some text with\n\ <strong>HTML</strong> tags...<br /><br />\n\
<strong>Keywords</strong> keyword1, keyword2,...<br /><br />\n\
...\n\
</p>").html(); // I user jquery-html() to extract the innerHTML of the outer <p> element


const arr=content.split(/<strong>(Summary|Keywords)<\/strong>/);
for (var i=1;i<arr.length;i+=2) window[arr[i]]=arr[i+1];

console.log('\nsummary:',Summary,'\nkeywords:',Keywords);  
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

答案 1 :(得分:1)

我认为问题出在“摘要”和“关键字”文本不是其各自标题的兄弟姐妹。

您可以改为使用正则表达式解析HTML字符串

<button id="startguess">Guess</button>
<input id="mynumber" type="number"></input>

答案 2 :(得分:1)

这是另一种方法; hacky,但可以正常工作:

const content = `<p>
    <strong>Summary</strong><br />Some text with <strong>HTML</strong> tags...<br /><br />
    <strong>Keywords</strong> keyword1, keyword2,...<br /><br />
    ...
    </p>`,
    html = $(content);

const summary  = getHtml(html.find("strong:contains(Summary)"));
const keywords = getHtml(html.find("strong:contains(Keywords)"));

console.log(summary);
console.log(keywords);

function getHtml(html) {
    const summary = [];
    let currentEl = html.prop("nextSibling");

    while (true) {
        // If the current and next element are both <br>, the end is reached
        if (currentEl.tagName === "BR" && currentEl.nextSibling.tagName === "BR") {

            // If this is "Keywords", don't add the trailing <br> elements
            if (html.text().trim() !== "Keywords") {
                // summary.push("<br><br>") would also work here
                summary.push(currentEl.outerHTML, currentEl.nextSibling.outerHTML);
            }

            return summary.join("").trim();
        } else {
            // nodeType 1 = element
            // nodeType 3 = text
            const content = currentEl.nodeType === 1 ? currentEl.outerHTML : currentEl.textContent;

            // Push HTML string and continue
            summary.push(content);
            currentEl = currentEl.nextSibling;
        }
    }
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>