Question

我正在创建一个将pdf解析为文本的服务。当我有那个文本时，我必须匹配一个单词数组。每当有一场比赛，它就会增加一个计数器。到现在为止还挺好。困难在于，在解析为文本时，我无法检查我所在的pdf页面。我已经意识到，在拆分中，每次有两个连续的换行符（/ n / n）都意味着存在页面更改。

我想做的是检查页面是否已更改，此外，除了计算总共找到一个单词的次数外，还要说出它在哪个页面上。

示例

let data =  `resignations / resignations. adm. mancom .: berenguer llinares
appointments. adm. unique: calvo valenzuela. other concepts: change of the administrative body:
joint administrators to sole administrator. change of registered office. ptda colomer, 6

Official Gazette of the Commercial Registry
no. 182 Friday, September 18, 2020 p. 33755
cve: borme-a-2020-182-03 verifiable in
sarria). registry data. t 2257, f 100, s 8, h a 54815, i / a 4 (10.09.20) .`



let wordsToSearch = ['resignations', "administrators"]

    wordsToSearch.forEach((word) => {
// inside of here would like to have track of the page as well
        let stringArray = data.split(' ');
        let count = 0;
        let result = ""
        for (var i = 0; i < stringArray.length; i++) {
            let wordText = stringArray[i];
            if (new RegExp(word).test(wordText)) {
                count++
            }
        }
        // the expected result would word has appeared count times in the pages etc
        result += `${word} has appeared ${count} times\n`
        console.log(result)
        /*
        resignations has appeared 2 times

        administrators has appeared 1 times
        */
    })

如果有人也想出另一种方式，那将很棒

Answer 1

您可以在两个换行符处分割文本，然后分别分析每个页面。这是我的处理方式：

let data = `resignations / Friday resignations. adm. mancom .: berenguer llinares
            appointments. adm. unique: calvo Friday valenzuela. other concepts: change of the administrative body:
            joint administrators to sole administrator. change of registered office. ptda colomer, 6, Friday

            Official Gazette of the Commercial Registry
            no. 182 Friday, September 18, 2020 p. 33755
            cve: borme-a-2020-182-03 verifiable in
            sarria). registry data. t 2257, f 100, s 8, h a 54815, i / a 4 (10.09.20) .`


function analyseText(text, wordsToFind) {
    const pages = data.split("\n\n");
    const result = {};
    for (let pageIndex = 0; pageIndex < pages.length; pageIndex++) {
        analysePage({
            pageIndex,
            pageText: pages[pageIndex]
        }, wordsToFind, result);
    }
    return Object.keys(result).map(k => result[k]);
}

function analysePage(page, wordsToFind, result) {
    const {
        pageText,
        pageIndex
    } = page;
    wordsToFind.forEach(word => {
        const count = (pageText.match(new RegExp(word, 'g')) || []).length;
        if (count > 0) {
            if (!result[word]) {
                result[word] = {
                    name: word,
                    pageIndices: [],
                    count: 0
                };
            }
            result[word].pageIndices.push(pageIndex);
            result[word].count += count;
        }
    });

}

const result = analyseText(data, ['resignations', "administrators", "Friday"]);
console.log(result);

在此示例中，我只打印每页的结果，但是您当然可以建立一些结果对象，以保存每页的结果。

检查数组的元素是否与以下相同

1 个答案: