如何从JSON文件中提取值?正则表达式是一个解决方案吗?

时间:2018-02-15 22:04:44

标签: json file parsing

我有一个非常大的文件,其条目如下所示:

{
  "_id": {
    "$oid": "572a5b93ae5174d3c4177da3"
  },
  "email": "removed@gmail.com",
  "gender": "F",
  "zip": "32934",
  "state": "FL",
  "city": "EAU GALLIE",
  "address1": "removed",
  "last_name": "removed",
  "first_name": "removed",
  "updatedAt": {
    "$date": "2016-05-04T20:29:02.061Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-05-04T20:28:54.948Z"
  }
}
{
  "_id": {
    "$oid": "57a49bed913aebc7257145b9"
  },
  "email": "removed@gmail.com",
  "dob": "11/06/1996",
  "gender": "F",
  "zip": "SN14 8BZ",
  "address1": "removed",
  "last_name": "removed",
  "first_name": "removed",
  "updatedAt": {
    "$date": "2016-08-16T23:53:30.161Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-08-05T14:00:13.130Z"
  }
}
{
  "_id": {
    "$oid": "57a49bed913aebc7257145d3"
  },
  "email": "removed@netzero.net",
  "zip": "NULL",
  "state": "NULL",
  "city": "NULL",
  "address1": "NULL",
  "last_name": "removed",
  "first_name": "removed",
  "updatedAt": {
    "$date": "2016-08-05T14:00:13.467Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-08-05T14:00:13.467Z"
  }
}
{
  "_id": {
    "$oid": "57ab71379f7474b50eef976d"
  },
  "updatedAt": {
    "$date": "2016-08-16T23:40:55.851Z"
  },
  "createdAt": {
    "$date": "2016-08-10T18:23:51.177Z"
  },
  "email": "removed@hotmail.co.uk",
  "ip": "0.0.0.0",
  "first_name": "removed",
  "last_name": "removed",
  "address1": "removed",
  "city": "",
  "state": "",
  "zip": "removed",
  "gender": "F",
  "__v": 0,
  "dob": "03/01/1973"
}
{
  "_id": {
    "$oid": "57ab7137913aebc725194a20"
  },
  "email": "removed@gmail.com",
  "job": "DeliveryDriver",
  "zip": "24401",
  "state": "VA",
  "city": "FISHERSVILLE",
  "updatedAt": {
    "$date": "2016-09-16T12:45:50.984Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-08-10T18:23:50.813Z"
  },
  "gender": "M",
  "last_name": "removed",
  "first_name": "removed"
}

并且它没有特定的顺序,我显然删除了名称,地址,IP和电子邮件以保护隐私。但是线路已经全部结束,其中超过20M。

我如何正确解析这个问题?我期待只提取电子邮件,IP,电话号码,姓名(第一个和最后一个)和地址(Zip,地址1,地址2,城市)

其中一些行只有电子邮件和IP,有的有电子邮件,IP,名称,还有一些有电子邮件,名称,地址等,包括一些所有行(它们都有一些垃圾数据,如OID,创建和更新日期,性别等)

解析此问题的最佳方法是什么?我已经尝试了一段时间,我知道它已经完成了,谢谢!

1 个答案:

答案 0 :(得分:0)

请勿尝试使用解析,而是尝试

它的跨平台。

示例,根据您的需要调整命令:

$ jq '(.email, .first_name, .last_name)' file.json

输出:

"removed@gmail.com"
"removed"
"removed"
"removed@gmail.com"
"removed"
"removed"
"removed@netzero.net"
"removed"
"removed"
"removed@hotmail.co.uk"
"removed"
"removed"
"removed@gmail.com"
"removed"
"removed"

检查https://stedolan.github.io/jq/

或者您可以使用代码