pdf2json:如何自定义输出json文件?

时间:2014-02-24 05:07:32

标签: json parsing pdf

我可以自定义pdf2json命令行实用程序的输出,以便输出json文件具有特定的结构吗?

我正在尝试从pdf中提取数据(请参见下图)并将其存储为json文件。

fig 1

我试过了pdf2json -f [input directory or pdf file]。该命令输出一个包含我需要的信息的json文件,但它也包含了许多我不需要的信息:

  

{"formImage":{"Transcoder":"pdf2json@0.6.6","Agency":"","Id":{"AgencyId":"","Name":"","MC":false,"Max":1,"Parent":""},"Pages":[{"Height":49.5,"HLines":[{"x":13.111828125000002,"y":4.678418750000001,"w":0.44775000000000004,"l":78.96384375000001},{"x":13.111828125000002,"y":44.074375,"w":0.44775000000000004,"l":78.96384375000001}],"VLines":[],"Fills":[{"x":0,"y":0,"w":0,"h":0,"clr":1}],"Texts":[{"x":13.632429687500002,"y":4.382312499999998,"w":4.163000000000001,"clr":0,"A":"left","R":[{"T":"abundant","S":-1,"TS":[0,13.9091,0,0]}]},{"x":25.021517303398443,"y":4.382312499999998,"w":4.139000000000001,"clr":0,"A":"left","R":[{"T":"positive%3A1","S":-1,"TS":[0,13.9091,0,0]}]},{"x":32.38324218816407,"y":4.382312499999998,"w":4.412000000000001,"clr":0,"A":"left","R":[{"T":"negative%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":40.12887364285157,"y":4.382312499999998,"w":3.1670000000000003,"clr":0,"A":"left","R":[{"T":"anger%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":46.1237223885547,"y":4.382312499999998,"w":5.993,"clr":0,"A":"left","R":[{"T":"anticipation%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":56.09123069480469,"y":4.382312499999998,"w":3.8400000000000003,"clr":0,"A":"left","R":[{"T":"disgust%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":63.0324864791797,"y":4.382312499999998,"w":2.4170000000000003,"clr":0,"A":"left","R":[{"T":"fear%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":67.97264684597657,"y":4.382312499999998,"w":2.109,"clr":0,"A":"left","R":[{"T":"joy%3A1","S":-1,"TS":[0,13.9091,0,0]}]},{"x":72.47968185183595,"y":4.382312499999998,"w":4.013,"clr":0,"A":"left","R":[{"T":"sadness%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":79.66421908894532,"y":4.382312499999998,"w":4.178000000000001,"clr":0,"A":"left","R":[{"T":"surprise%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":87.08078776941407,"y":4.382312499999998,"w":2.8930000000000002,"clr":0,"A":"left","R":[{"T":"trust%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":13.632429687500002,"y":5.017468750000002,"w":2.4480000000000004,"clr":0,"A":"left","R":

我只需要pdf文件中的文字。我不需要任何有关格式的信息。所以我需要这样的东西:

{"data":
    {
    "abundant": {
        "positive":1,
        "negative":0,
        "anger":0,
        ...
        },
    "abuse": {...},
    "abutment": {...},
    ...
    }
}

1 个答案:

答案 0 :(得分:0)

我构建了一个Node.js模块,该模块使用pdf2json和一些简单的数学运算来从PDF中提取表格数据。输出是一个行数组。

https://www.npmjs.com/package/pdf2table