用于根据节点名称拆分大型JSON的通用代码

时间:2019-01-23 12:14:12

标签: c# arrays json json.net

我有一个非常大的JSON文件,现在下面的car数组可以多达100,000,000条记录。文件的总大小可以从500mb到10 GB不等。我正在使用Newtonsoft json.net

输入

{
"name": "John",
"age": "30",
"cars": [{
    "brand": "ABC",
    "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
    "year": "2019",
    "month": "1",
    "day": "1"
}, {
    "brand": "XYZ",
    "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
    "year": "2019",
    "month": "10",
    "day": "01"
}],
"TestCity": "TestCityValue",
"TestCity1": "TestCityValue1"}

所需的输出 文件1杰森

   {
    "name": "John",
    "age": "30",
    "cars": {
        "brand": "ABC",
        "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
        "year": "2019",
        "month": "1",
        "day": "1"
    },
    "TestCity": "TestCityValue",
    "TestCity1": "TestCityValue1"
}

文件2杰森

{
    "name": "John",
    "age": "30",
    "cars": {
        "brand": "XYZ",
        "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
        "year": "2019",
        "month": "10",
        "day": "01"
    },
    "TestCity": "TestCityValue",
    "TestCity1": "TestCityValue1"
}

所以我想出了下面的代码,

 public static void SplitJson(Uri objUri, string splitbyProperty)
    {
        try
        {
            bool readinside = false;
            HttpClient client = new HttpClient();
            using (Stream stream = client.GetStreamAsync(objUri).Result)
            using (StreamReader streamReader = new StreamReader(stream))
            using (JsonTextReader reader = new JsonTextReader(streamReader))
            {
                Node objnode = new Node();
                while (reader.Read())
                {
                    JObject obj = new JObject(reader);


                    if (reader.TokenType == JsonToken.String && reader.Path.ToString().Contains("name") && !reader.Value.ToString().Equals(reader.Path.ToString()))
                    {
                        objnode.name = reader.Value.ToString();
                    }

                    if (reader.TokenType == JsonToken.Integer && reader.Path.ToString().Contains("age") && !reader.Value.ToString().Equals(reader.Path.ToString()))
                    {
                        objnode.age = reader.Value.ToString();

                    }

                    if (reader.Path.ToString().Contains(splitbyProperty) && reader.TokenType == JsonToken.StartArray)
                    {
                        int counter = 0;
                        while (reader.Read())
                        {
                            if (reader.TokenType == JsonToken.StartObject)
                            {
                                counter = counter + 1;
                                var item = JsonSerializer.Create().Deserialize<Car>(reader);
                                objnode.cars = new List<Car>();
                                objnode.cars.Add(item);
                                insertIntoFileSystem(objnode, counter);
                            }

                            if (reader.TokenType == JsonToken.EndArray)
                                break;
                        }
                    }

                }

            }

        }
        catch (Exception)
        {

            throw;
        }
    }
    public static void insertIntoFileSystem(Node objNode, int counter)
    {

        string fileName = @"C:\Temp\output_" + objNode.name + "_" + objNode.age + "_" + counter + ".json";
        var serialiser = new JsonSerializer();
        using (TextWriter tw = new StreamWriter(fileName))
        {
            using (StringWriter textWriter = new StringWriter())
            {
                serialiser.Serialize(textWriter, objNode);
                tw.WriteLine(textWriter);
            }
        }
    }

问题

  1. 如果文件很大,则不捕获数组后的任何字段。有没有一种方法可以跳过或并行处理json中大型数组的读取器。简而言之,我无法使用我的代码捕获以下部分

    “ TestCity”:“ TestCityValue”,     “ TestCity1”:“ TestCityValue1”}

1 个答案:

答案 0 :(得分:3)

您将需要分两步处理大型JSON文件,才能获得所需的结果。

在第一遍中,将文件分为两部分:创建一个仅包含巨大数组的文件,以及一个包含所有其他信息的第二个文件,这些文件将用作您最终想要的各个JSON文件的模板创建。

在第二遍中,将模板文件读入内存(我假设JSON的这一部分相对较小,因此这应该不成问题),然后使用读取器一次处理数组文件一项。对于每一项,请将其与模板结合在一起,然后将其写入单独的文件中。

最后,您可以删除临时阵列和模板文件。

这是代码中的样子:

using System.IO;
using System.Text;
using System.Net.Http;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;

public static void SplitJson(Uri objUri, string arrayPropertyName)
{
    string templateFileName = @"C:\Temp\template.json";
    string arrayFileName = @"C:\Temp\array.json";

    // Split the original JSON stream into two temporary files:
    // one that has the huge array and one that has everything else
    HttpClient client = new HttpClient();
    using (Stream stream = client.GetStreamAsync(objUri).Result)
    using (JsonReader reader = new JsonTextReader(new StreamReader(inputStream)))
    using (JsonWriter templateWriter = new JsonTextWriter(new StreamWriter(templateFileName)))
    using (JsonWriter arrayWriter = new JsonTextWriter(new StreamWriter(arrayFileName)))
    {
        if (reader.Read() && reader.TokenType == JsonToken.StartObject)
        {
            templateWriter.WriteStartObject();
            while (reader.Read() && reader.TokenType != JsonToken.EndObject)
            {
                string propertyName = (string)reader.Value;
                reader.Read();
                templateWriter.WritePropertyName(propertyName);
                if (propertyName == arrayPropertyName)
                {
                    arrayWriter.WriteToken(reader);
                    templateWriter.WriteStartObject();  // empty placeholder object
                    templateWriter.WriteEndObject();
                }
                else if (reader.TokenType == JsonToken.StartObject ||
                         reader.TokenType == JsonToken.StartArray)
                {
                    templateWriter.WriteToken(reader);
                }
                else
                {
                    templateWriter.WriteValue(reader.Value);
                }
            }
            templateWriter.WriteEndObject();
        }
    }

    // Now read the huge array file and combine each item in the array
    // with the template to make new files
    JObject template = JObject.Parse(File.ReadAllText(templateFileName));
    using (JsonReader arrayReader = new JsonTextReader(new StreamReader(arrayFileName)))
    {
        int counter = 0;
        while (arrayReader.Read())
        {
            if (arrayReader.TokenType == JsonToken.StartObject)
            {
                counter++;
                JObject item = JObject.Load(arrayReader);
                template[arrayPropertyName] = item;
                string fileName = string.Format(@"C:\Temp\output_{0}_{1}_{2}.json",
                                                template["name"], template["age"], counter);

                File.WriteAllText(fileName, template.ToString());
            }
        }
    }

    // Clean up temporary files
    File.Delete(templateFileName);
    File.Delete(arrayFileName);
}

请注意,由于存在临时文件,上述方法在处理期间将需要将原始JSON的磁盘空间增加一倍。如果存在问题,则可以修改代码以改为两次下载文件(尽管这可能会增加处理时间)。在第一次下载中,创建模板JSON并忽略该数组;在第二次下载中,前进到数组并像以前一样使用模板对其进行处理,以创建输出文件。