Question

我目前正在开发一个chrome扩展程序，该扩展程序使用网站的html文档提取数据，但是我需要做一个过滤器才能获得我真正想要的。

在这种尝试中，扩展程序获取页面的HTML并将其转换为字符串，以便可以轻松地对其进行操作：

//This method gets a string and counts how many times
//the word you're looking for its in the string
function countWordInAString(string, word) {
    return (string.match(new RegExp(word, "g")) || []).length;
}

function getOutlookData(html) {
    var unreaded = countWordInAString(html, 'no leídos');
    var readed = countWordInAString(html, 'leídos');
    var totalMails = countWordInAString(html, 'id="AQAAA1thnTQBAAAEA7R1mgAAAAA="');
    var message = totalMails + 'Mails loaded! \n Mails readed: ' + readed + '\n Mails unreaded: ' + unreaded;

    return message + '\n' + "HTML:\n" + html;
}

在某些特定情况下它可以工作，但是对于混淆的网站（例如本示例中的Outlook），结果是错误的。我可以做些什么来改善它？

Answer 1

您的“单词”可能包含特殊字符。传递给正则表达式时，请使用反斜杠对其进行编码即

const encodeForReg = str => str.replace(/([^\s\w])/g, '\\$1');
function countWordInAString(string, word) {
    const encodedWord = encodeForReg(word);
    return (string.match(new RegExp(encodedWord, "g")) || []).length;
}

id="AQAAA1thnTQBAAAEA7R1mgAAAAA="

成为

id\=\"AQAAA1thnTQBAAAEA7R1mgAAAAA\=\"

有什么办法可以过滤HTML文档的数据？

1 个答案: