从大文件读取对象时如何过滤JSON数组中的对象

时间:2019-02-07 18:00:15

标签: c# json.net

使用JSON.NET,我正在从一个大文件中读取数组中的JSON对象。 读取JSON对象时,会将其有条件地转换为目标类,并作为IEnumerable中的项返回。

我使用IEnumerable允许我从文件中“拉出”对象并在读取对象时对其进行处理,从而避免了将所有对象都读取到内存中的情况。

从CSV文件读取行时,我使用了类似的技术,其中使用CsvHelper ShouldSkipRecord()有条件地处理CSV文件中的行。

我还没有找到一种方法来过滤从数组中读取的JSON对象,因此我最终使用LINQ Where来过滤对象,然后将它们转换并添加到IEnumerable中。问题在于,Where子句将所有对象读取到内存中,从而无法实现使用IEnumerable的目的。

我知道我可以手动读取每个对象,然后对其进行处理,但是我正在寻找一种更优雅的方式来具有某种形式的回调,该回调将允许我提取记录和不需要的回调过滤器记录。< / p>

例如如何过滤CSV文件中的行:

internal static bool ShouldSkipRecord(string[] fields)
{
    // Skip rows with incomplete data
    // 2019-01-24 20:46:57 UTC,63165,4.43,6.23,6.80,189,-18,81.00,16.00,6.23
    // 2019 - 01 - 24 20:47:40 UTC,63166,4.93,5.73,5.73,0,-20,,,5.73
    if (fields.Length < 10)
        return true;

    // Temperature and humidity is optional, air quality is required
    if (string.IsNullOrEmpty(fields[9]))
        return true;

    return false;
}

例如我如何过滤JSON对象:

internal static PurpleAirData Convert(Feed jsonData)
{
    PurpleAirData data = new PurpleAirData()
    {
        TimeStamp = jsonData.CreatedAt.DateTime,
        AirQuality = Double.Parse(jsonData.Field8)
    };

    // Temperature and humidity is optional
    if (double.TryParse(jsonData.Field6, out double val))
        data.Temperature = val;
    if (double.TryParse(jsonData.Field7, out val))
        data.Humidity = val;

    return data;
}

internal static IEnumerable<PurpleAirData> Load(JsonTextReader jsonReader)
{
    // Deserialize objects in parts
    jsonReader.SupportMultipleContent = true;
    JsonSerializer serializer = new JsonSerializer();

    // Read Channel
    // TODO : Add format checking
    jsonReader.Read();
    jsonReader.Read();
    jsonReader.Read();
    Channel channel = serializer.Deserialize<Channel>(jsonReader);

    // Read the Feeds
    jsonReader.Read();
    jsonReader.Read();
    // TODO : The Where results in a full in-memory iteration defeating the purpose of the streaming iteration
    return serializer.Deserialize<List<Feed>>(jsonReader).Where(feed => !string.IsNullOrEmpty(feed.Field8)).Select(Convert);
}

示例JSON:

{
   "channel":{
      "id":622370,
      "name":"AirMonitor_e81a",
      "latitude":"0.0",
      "longitude":"0.0",
      "field1":"PM1.0 (ATM)",
      "field2":"PM2.5 (ATM)",
      "field3":"PM10.0 (ATM)",
      "field4":"Uptime",
      "field5":"RSSI",
      "field6":"Temperature",
      "field7":"Humidity",
      "field8":"PM2.5 (CF=1)",
      "created_at":"2018-11-09T00:35:34Z",
      "updated_at":"2018-11-09T00:35:35Z",
      "last_entry_id":65435
   },
   "feeds":[
      {
         "created_at":"2019-01-10T23:56:09Z",
         "entry_id":56401,
         "field1":"1.00",
         "field2":"1.80",
         "field3":"1.80",
         "field4":"369",
         "field5":"-30",
         "field6":"66.00",
         "field7":"59.00",
         "field8":"1.80"
      },
      {
         "created_at":"2019-01-10T23:57:29Z",
         "entry_id":56402,
         "field1":"1.08",
         "field2":"2.44",
         "field3":"3.33",
         "field4":"371",
         "field5":"-32",
         "field6":"66.00",
         "field7":"59.00",
         "field8":"2.44"
      },
      {
         "created_at":"2019-01-26T00:14:04Z",
         "entry_id":64400,
         "field1":"0.27",
         "field2":"0.95",
         "field3":"1.25",
         "field4":"213",
         "field5":"-27",
         "field6":"72.00",
         "field7":"40.00",
         "field8":"0.95"
      }
   ]
}

示例JSON:

[
{
    "monthlyrainin": 0.01,
    "humidityin": 42,
    "eventrainin": 0,
    "humidity": 29,
    "maxdailygust": 20.13,
    "dateutc": 1549476900000,
    "battout": "1",
    "lastRain": "2019-02-05T19:21:00.000Z",
    "dailyrainin": 0,
    "tempf": 52.2,
    "winddir": 286,
    "totalrainin": 0.01,
    "dewPoint": 20.92,
    "baromabsin": 29.95,
    "hourlyrainin": 0,
    "feelsLike": 52.2,
    "yearlyrainin": 0.01,
    "uv": 1,
    "weeklyrainin": 0.01,
    "solarradiation": 157.72,
    "windspeedmph": 0,
    "tempinf": 73.8,
    "windgustmph": 0,
    "battin": "1",
    "baromrelin": 30.12,
    "date": "2019-02-06T18:15:00.000Z"
},
{
    "dewPoint": 20.92,
    "tempf": 52.2,
    "maxdailygust": 20.13,
    "humidityin": 42,
    "windspeedmph": 4.03,
    "eventrainin": 0,
    "tempinf": 73.6,
    "feelsLike": 52.2,
    "dateutc": 1549476600000,
    "windgustmph": 4.92,
    "hourlyrainin": 0,
    "monthlyrainin": 0.01,
    "battin": "1",
    "humidity": 29,
    "totalrainin": 0.01,
    "baromrelin": 30.12,
    "winddir": 314,
    "lastRain": "2019-02-05T19:21:00.000Z",
    "yearlyrainin": 0.01,
    "baromabsin": 29.94,
    "dailyrainin": 0,
    "battout": "1",
    "uv": 1,
    "solarradiation": 151.86,
    "weeklyrainin": 0.01,
    "date": "2019-02-06T18:10:00.000Z"
}]

JSON.NET中是否有一种方法可以在读取对象时对其进行过滤?

1 个答案:

答案 0 :(得分:1)

您可以做的是采用 Issues parsing a 1GB json file using JSON.NET Deserialize json array stream one item at a time 的基本方法,该方法将流经数组并产生收益每一个项目;但除此之外,还可以应用where表达式来过滤不完整的项目,或者应用select子句将一些中间反序列化的对象(例如JObjectDTO)转换为最终数据模型。通过在流传输期间应用where子句,不需要的对象将永远不会添加到要反序列化的列表中,因此在流传输期间将被垃圾收集器清除。在流式传输时过滤数组内容可以在根级别,当根JSON容器是数组时进行,或者在要反序列化的数组与某些外部JSON嵌套时,作为List<T>的一部分custom JsonConverter的一部分

作为一个具体示例,请考虑您的第一个JSON示例。您想将其反序列化为如下所示的数据模型:

public class PurpleAirData
{
    public PurpleAirData(DateTime createdAt, double airQuality)
    {
        this.CreatedAt = createdAt;
        this.AirQuality = airQuality;
    }
    // Required properties
    public DateTime CreatedAt { get; set; }
    public double AirQuality { get; set; }

    // Optional properties, thus nullable
    public double? Temperature { get; set; }
    public double? Humidity { get; set; }
}

public class RootObject
{
    public Channel channel { get; set; } // Define this using http://json2csharp.com/
    public List<PurpleAirData> feeds { get; set; }
}

为此,首先介绍以下扩展方法:

public static partial class JsonExtensions
{
    public static IEnumerable<T> DeserializeArrayItems<T>(this JsonSerializer serializer, JsonReader reader)
    {
        if (reader.MoveToContent().TokenType == JsonToken.Null)
            yield break;
        if (reader.TokenType != JsonToken.StartArray)
            throw new JsonSerializationException(string.Format("Current token {0} is not an array at path {1}", reader.TokenType, reader.Path));
        // Process the collection items
        while (reader.Read())
        {
            switch (reader.TokenType)
            {
                case JsonToken.EndArray:
                    yield break;

                case JsonToken.Comment:
                    break;

                default:
                    yield return serializer.Deserialize<T>(reader);
                    break;
            }
        }
        // Should not come here.
        throw new JsonReaderException(string.Format("Unclosed array at path {0}", reader.Path));
    }

    public static JsonReader MoveToContent(this JsonReader reader)
    {
        if (reader.TokenType == JsonToken.None)
            reader.Read();
        while (reader.TokenType == JsonToken.Comment && reader.Read())
            ;
        return reader;
    }
}

接下来,为JsonConverter引入以下List<PurpleAirData>

class PurpleAirListConverter : JsonConverter
{
    class PurpleAirDataDTO
    {
        // Required properties
        [JsonProperty("created_at")]
        public DateTime? CreatedAt { get; set; }
        [JsonProperty("Field8")]
        public double? AirQuality { get; set; }

        // Optional properties
        [JsonProperty("Field6")]
        public double? Temperature { get; set; }
        [JsonProperty("Field7")]
        public double? Humidity { get; set; }
    }

    public override bool CanConvert(Type objectType)
    {
        return objectType == typeof(List<PurpleAirData>);
    }

    public override object ReadJson(JsonReader reader, Type objectType, object existingValue, JsonSerializer serializer)
    {
        if (reader.MoveToContent().TokenType == JsonToken.Null)
            return null;
        var list = existingValue as List<PurpleAirData> ?? new List<PurpleAirData>();

        var query = from dto in serializer.DeserializeArrayItems<PurpleAirDataDTO>(reader)
                    where dto != null && dto.CreatedAt != null && dto.AirQuality != null
                    select new PurpleAirData(dto.CreatedAt.Value, dto.AirQuality.Value) { Humidity = dto.Humidity, Temperature = dto.Temperature };

        list.AddRange(query);

        return list;
    }

    public override void WriteJson(JsonWriter writer, object value, JsonSerializer serializer)
    {
        throw new NotImplementedException();
    }
}

此转换器的目的是流经"feeds"数组,将每个JSON项反序列化到中间PurpleAirDataDTO,检查是否存在必需的成员,然后将DTO转换为最终模型。

最后,按如下所示反序列化整个文件:

static RootObject DeserializePurpleAirDataFile(TextReader textReader)
{
    var settings = new JsonSerializerSettings
    {
        Converters = { new PurpleAirListConverter() },
        NullValueHandling = NullValueHandling.Ignore,
    };
    var serializer = JsonSerializer.CreateDefault(settings);
    using (var reader = new JsonTextReader(textReader) { CloseInput = false })
    {
        return serializer.Deserialize<RootObject>(reader);
    }
}

演示小提琴here

当要过滤的数组是JSON文件中的根容器时,扩展方法JsonExtensions.DeserializeArrayItems()可以直接使用,例如如下:

static bool IsValid(WeatherData data)
{
    // Return false if certain fields are missing

    // Otherwise return true;
    return true;
}

static List<WeatherData> DeserializeFilteredWeatherData(TextReader textReader)
{
    var serializer = JsonSerializer.CreateDefault();
    using (var reader = new JsonTextReader(textReader) { CloseInput = false })
    {
        var query = from data in serializer.DeserializeArrayItems<WeatherData>(reader)
                    where IsValid(data)
                    select data;

        return query.ToList();
    }
}

注意:

  • nullable类型可用于跟踪反序列化过程中是否实际遇到了值类型成员。

  • 这里是手动完成从DTO到最终数据模型的转换,但是对于更复杂的模型,可以使用类似的模型。