解析大型json文件

时间:2014-12-23 14:38:24

标签: java json swing

我有一个关于将json转换为csv的问题 - 尤其是内存问题(至少我认为它是一个)。我编写了一些应该处理这种情况的函数,它适用于小型json文件。对于大型json文件,JFrame会被卡住,并且几分钟内都没有发生(我在约5分钟后使用任务管理器杀死了该进程)。 源json文件大约有30.000行。

我要做的事情:

  • 阅读(大)json文件
  • 纠正它(某些值不是典型的json,即"actor" : "ObjectId("12345")等应更正为"actor" : "12345"
  • 将大型json文件拆分为较小的文件。
  • 使用小json文件进行处理。

到目前为止我所拥有的:

public void mongoExportAndSplitFilter() {
    ReadFileAndSave reader = new ReadFileAndSave();
    String jsonFilePath = this.converterView.sourceTextField.getText();
    //String targetFilePath = this.converterView.targetTextField.getText();
    File jsonFile = new File(jsonFilePath);
    Scanner scanner = new Scanner(reader.readFileAndCorrectOutput(jsonFile));
    int j = 0;
    StringBuffer sb = new StringBuffer();
    reader.readPartOfFileAndSave("src/main/resources", scanner, j, sb);
    //System.out.println("STEP 1: INPUT FILE (" + jsonFilePath + ") HAS BEEN CORRECTED!");
    //System.out.println("STEP 2: INPUT FILE (" + jsonFilePath + ") HAS BEEN SPLITTED WHILE PARSING!");
    this.filterView.setVisible(false);
    this.filterView.dispose();
    this.filterFlag = 1;
}

/**
 * Utility function to correct the MongoExport-JSON-Output.
 *
 * @param file The file which should be corrected.
 * @return Returns the correct JSON-String.
 */
public String readFileAndCorrectOutput(File file) {
    String jsonStringCorrected = "";
    StringBuffer sb = new StringBuffer();
    try {
        Scanner scanner = new Scanner(file);
        while (scanner.hasNext()) {
            String next = scanner.next();

            if (next.contains("ObjectId") || next.contains("ISODate")) {
                Matcher m = Pattern.compile(this.regEx)
                        .matcher(next);

                if (m.find()) {
                    next = next.replaceAll(this.regEx, this.innerString);
                }
            }
            //jsonStringCorrected += next;
            sb.append(next);
        }
        scanner.close();

        jsonStringCorrected = sb.toString();
        JSONObject jsonObject = new JSONObject(jsonStringCorrected);
        jsonStringCorrected = jsonObject.toString(2);
    } catch (FileNotFoundException ex) {
        Logger.getLogger(ReadFileAndSave.class.getName()).log(Level.SEVERE, null, ex);
    }
    return jsonStringCorrected;
}

/*
 * Utility-function to read a json file part by part and save the parts to a separate json file.
 * @param   scanner     The scanner which contains the file and which returns the lines from the file.
 * @param   j               The counter of the file. As the file should change whenever the counter changes.
 * @return  jsonString  The content of the jsonString.
 */
public String readPartOfFileAndSave(String filepath, Scanner scanner, int j, StringBuffer sb) {


    String jsonString = "";
    int i = 0;
    ++j;
    while (scanner.hasNext()) {
        String token = scanner.next();

        //jsonString += token;
        sb.append(token);
        if (token.contains("{")) {
            i++;
        }
        if (token.contains("}")) {
            i--;
        }
        if (i == 0) {
            jsonString = sb.toString();
            JSONObject jsonObject = new JSONObject(jsonString);
            jsonString = jsonObject.toString(2);
            saveFile(filepath, "actor", j, jsonString);
            jsonString = readPartOfFileAndSave(filepath, scanner, j);
        }
    }
    return "";
}

有谁知道如何解决这个问题?

修改

这是文件的片段(前3行):

{ "verb" : "access", "target" : { "id" : "5485a7050ac61b1339a4da0e", "inquiryPhase" : "Orientation", "displayName" : "Orientation", "objectType" : "phase" }, "generator" : { "id" : "5485a7050ac61b1339a4da09", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "provider" : { "id" : "5485a7050ac61b1339a4da09", "inquiryPhase" : "ils", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "object" : { "id" : "5485a7050ac61b1339a4da09", "displayName" : "LochemC", "objectType" : "ils" }, "actor" : { "id" : "Bas Kollöffel (UT)@5485a7050ac61b1339a4da09", "displayName" : "Bas Kollöffel (UT)", "objectType" : "person" }, "published" : "2014-12-08T13:40:45.409Z", "publishedClient" : "2014-12-08T13:40:45.409Z", "publishedServer" : { "$date" : 1418046045490 }, "_id" : { "$oid" : "5485aa5dc372cdbb21daea33" } }
{ "verb" : "access", "target" : { "id" : "5485a7050ac61b1339a4da13", "inquiryPhase" : "Conceptualisation", "displayName" : "Conceptualisation", "objectType" : "phase" }, "generator" : { "id" : "5485a7050ac61b1339a4da09", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "provider" : { "id" : "5485a7050ac61b1339a4da09", "inquiryPhase" : "ils", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "object" : { "id" : "5485a7050ac61b1339a4da13", "inquiryPhase" : "Conceptualisation", "displayName" : "Conceptualisation", "objectType" : "phase" }, "actor" : { "id" : "Bas Kollöffel (UT)@5485a7050ac61b1339a4da09", "displayName" : "Bas Kollöffel (UT)", "objectType" : "person" }, "published" : "2014-12-08T13:40:46.867Z", "publishedClient" : "2014-12-08T13:40:46.867Z", "publishedServer" : { "$date" : 1418046046952 }, "_id" : { "$oid" : "5485aa5ec372cdbb21daea34" } }
{ "verb" : "access", "target" : { "id" : "5485a7050ac61b1339a4da1e", "inquiryPhase" : "Investigation", "displayName" : "Investigation", "objectType" : "phase" }, "generator" : { "id" : "5485a7050ac61b1339a4da09", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "provider" : { "id" : "5485a7050ac61b1339a4da09", "inquiryPhase" : "ils", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "object" : { "id" : "5485a7050ac61b1339a4da1e", "inquiryPhase" : "Investigation", "displayName" : "Investigation", "objectType" : "phase" }, "actor" : { "id" : "Bas Kollöffel (UT)@5485a7050ac61b1339a4da09", "displayName" : "Bas Kollöffel (UT)", "objectType" : "person" }, "published" : "2014-12-08T13:40:48.582Z", "publishedClient" : "2014-12-08T13:40:48.582Z", "publishedServer" : { "$date" : 1418046048662 }, "_id" : { "$oid" : "5485aa60c372cdbb21daea35" } }

1 个答案:

答案 0 :(得分:0)

不要立即阅读整个文件。逐行阅读,进行更正,并在出发时写入输出。

此外,它看起来不像你需要在这里解析和重新创建json。应该能够在原始文本级别执行您需要的所有处理。

而且我也不认为你需要递归readPartOfFileAndSave(),可以在外循环中做所有事情。