Question

我在数据库中存储了一些文本，如下所示：

let text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>"

文本可以包含许多段落和HTML标记。

现在，我还有一个短语：

let phrase = 'lose touch'

我想做的是在phrase中搜索text，然后在phrase标签中返回包含strong的完整句子。

在上面的示例中，即使第一个段落也包含短语'lose touch'，它也应返回第二个句子，因为该短语在第二个句子中位于{{1 }} 标签。结果将是：

strong

在客户端，我可以使用此HTML文本创建DOM树，将其转换为数组，并搜索数组中的每个项目，但是在NodeJS文档中不可用，因此这基本上只是纯文本，带有HTML标记。如何在此文本框中找到正确的句子？

Answer 1

我认为这可能对您有帮助。

如果我正确理解了这个问题，则无需让DOM参与其中。

即使p或强标签中具有属性，此解决方案也可以使用。

如果您要搜索除p之外的其他标签，只需为其更新正则表达式即可。

const search_phrase = "lose touch";
const strong_regex = new RegExp(`<\s*strong[^>]*>${search_phrase}<\s*/\s*strong>`, "g");
const paragraph_regex = new RegExp("<\s*p[^>]*>(.*?)<\s*/\s*p>", "g");
const text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>";

const paragraphs = text.match(paragraph_regex);

if (paragraphs && paragraphs.length) {
    const paragraphs_with_strong_text =  paragraphs.filter(paragraph => {
        return strong_regex.test(paragraph);
    });
    console.log(paragraphs_with_strong_text);
    // prints [ '<p>They don\'t just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>' ]
}

P.S。代码未经过优化，您可以根据应用程序中的要求进行更改。

Answer 2

有cheerio，类似于服务器端jQuery。因此，您可以将页面作为文本获取，构建DOM并在其中进行搜索。

Answer 3

首先，您可以var arr = text.split("")，以便能够分别处理每个句子

然后您可以遍历数组并在强标签内搜索短语

for(var i = 0; i<arr.length;i++){ if(arr[i].search(""+phrase+"")!=-1){ console.log(""+arr[i]); //arr[i] is the the entire sentence containing phrase inside strong tags minus "" } }

NodeJS：基于短语从html文本中提取句子

3 个答案: