Question

我有一个包含用户完整记录的大型JSON数据集A（180,000条记录）和仅包含一些用户唯一ID和名称（约1,500条记录）的另一个JSON数据集B（A的子集）。我需要从数据集A获取数据集B中用户的完整记录。

这是我到目前为止尝试过的

let detailedSponsoreApplicants = [];
let j;
        for(j=0; j < allApplicants.length; j++){
            let a = allApplicants[j];

            let i;
            for(i=0; i < sponsoredApplicants.length;; i++){
                let s = sponsoredApplicants[i];
                if (s && s.number === a.applicationNumber) {
                    detailedSponsoreApplicants.push(a);
                }else{                
                    if(s){
                        logger.warn(`${s.number} not found in master list`);
                    }
                }
            }

        }

上述代码的问题是在某个时候我得到了错误 FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

所以，我如何有效地完成任务而没有错误。

编辑-JSON样本

Dataset A
{
  "applicationNumber": "3434343"
  "firstName": "dcds",
  "otherNames": "sdcs",
  "surname": "sdcs"
  "phone": "dscd",
  .
  .
  .
  "stateOfOrigin": "dcsd"
}

Dataset B
{
    "number": "3434343",
    "fullName": "dcds sdcs sdcs"
}

Answer 1

尝试使用giving node more memory进行以下操作：

node --max-old-space-size=1024 index.js #increase to 1gb
node --max-old-space-size=2048 index.js #increase to 2gb
node --max-old-space-size=3072 index.js #increase to 3gb
node --max-old-space-size=4096 index.js #increase to 4gb
node --max-old-space-size=5120 index.js #increase to 5gb
node --max-old-space-size=6144 index.js #increase to 6gb
node --max-old-space-size=7168 index.js #increase to 7gb
node --max-old-space-size=8192 index.js #increase to 8gb

此外，您的脚本可能需要很长时间才能运行。如果要提高性能，请考虑使用Map或将大型数组转换为对象以进行快速查找：

const obj = a.reduce((obj, current) => {
  obj[current.applicationNumber] = current;
  return obj;
}, {});

然后您可以在固定时间内查找完整的详细信息：

const fullDetailsOfFirstObject = obj[B[0].number];

Answer 2

也许不是最有效的方法，但可行的方法是：

1）将数据集A（庞大的数据集）导入数据库。例如sqlite或您熟悉的数据库。

2）为字段applicationNumber添加索引。

3）在数据库中查询数据集B中的每个元素，或尝试批量查询（一次选择多个）。

我以前在类似的用例中已经做到这一点，并且它可以工作，但就您而言，仍然可能有更好的方法。

根据另一个数据集过滤大型JSON数据集

2 个答案: