使用Apache Drill搜索Firebase JSON

时间:2015-09-03 11:47:29

标签: json firebase apache-drill

我从https://domain.firebaseio.com/users/

导出了部分数据

{
  "3": {
    "company": "",
    "d_year": "",
    "email": "mario.giambanco@domain.com",
    "facebook": "",
    "fullname": "Mario Test",
    "google": "",
    "igoto": "",
    "image": "",
    "notifications": {
      "-Jx6fpaJHvKPHc8CylPd": {
        "from": "System",
        "image": "/img/system_icon.jpg",
        "msg": "System:",
        "param": "3",
        "posteddate": 1440016723546,
        "type": "system"
      }
    },
    "school": "",
    "school_year": "",
    "tags": {
      "-JxWuEPs183UEwsI-XNb": {
        "title": "Anesthesia"
      },
      "-JxWuZ-ePcx0XqYRmzc6": {
        "title": "Bridges"
      }
    },
    "twitter": ""
  },
  "4": {
    "company": "",
    "d_year": "",
    "email": "mariogiambanco@domain.com",
    "fullname": "mario test",
    "igoto": "",
    "image": "img/a0.jpg",
    "notifications": {
      "-JxAQpWGzY-gOzej7Xis": {
        "from": "System",
        "image": "/img/system_icon.jpg",
        "msg": "System:",
        "param": "4",
        "posteddate": 1440079641420,
        "type": "system"
      }
    },
    "school": "",
    "school_year": ""
  }
}

执行: SELECT * FROM dfs。/Users/me/Desktop/users.json

工作(或者,至少我得到一个结果) enter image description here

但是如何将列映射为行中的值。从关系数据库世界来看,屏幕截图中的列标题是唯一ID(3,4) - 它们应该是行的一部分,而不是列标题。使用push({})

生成的唯一生成密钥也是如此

目标当然是选择Where(选择*来自fullname =“Mario Test”的数据),例如

在使用Drill搜索之前,我应该对JSON进行某种预处理吗?

2 个答案:

答案 0 :(得分:2)

可能有另一种方法可以做到这一点,但我会说是的,您可能需要稍微转换数据,以便使用Drill进行查询。

这看起来像你想要使用KVGEN的情况。 KVGEN会给你一些Chris Matta所描述的专栏,但是KVGEN在一个专栏上运行,在这种情况下,实际上没有专栏可供使用:

0: jdbc:drill:zk=local> select t.* from dfs.`/Users/vince/data/stackoverflow/users.json` t;
+---+---+
| 3 | 4 |
+---+---+
| {"company":"","d_year":"","email":"mario.giambanco@domain.com","facebook":"","fullname":"Mario Test","google":"","igoto":"","image":"","notifications":{"-Jx6fpaJHvKPHc8CylPd":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"3","posteddate":1440016723546,"type":"system"}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{"title":"Anesthesia"},"-JxWuZ-ePcx0XqYRmzc6":{"title":"Bridges"}},"twitter":""} | {"company":"","d_year":"","email":"mariogiambanco@domain.com","fullname":"mario test","igoto":"","image":"img/a0.jpg","notifications":{"-JxAQpWGzY-gOzej7Xis":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"4","posteddate":1440079641420,"type":"system"}},"school":"","school_year":""} |
+---+---+
1 row selected (0.133 seconds)

由于这些列是JSON对象的“顶级”的动态AND,因此您无法在此处使用KVGEN。但是如果你只是稍微改变数据,你可以使用KVGEN。我使用了这个最优秀的工具jq调用按摩数据到KVGEN可以使用的格式:

$ jq '.| { "user": . }' < users.json > users_kv.json

这将获取输入,并将JSON对象包装在另一个映射中,这将为我们提供执行以下操作所需的“静态”列:

0: jdbc:drill:zk=local> select kvgen(t.`user`) from dfs.`/Users/vince/data/stackoverflow/users_kv.json` t;
+--------+
| EXPR$0 |
+--------+
| [{"key":"3","value":{"company":"","d_year":"","email":"mario.giambanco@domain.com","facebook":"","fullname":"Mario Test","google":"","igoto":"","image":"","notifications":{"-Jx6fpaJHvKPHc8CylPd":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"3","posteddate":1440016723546,"type":"system"},"-JxAQpWGzY-gOzej7Xis":{}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{"title":"Anesthesia"},"-JxWuZ-ePcx0XqYRmzc6":{"title":"Bridges"}},"twitter":""}},{"key":"4","value":{"company":"","d_year":"","email":"mariogiambanco@domain.com","fullname":"mario test","igoto":"","image":"img/a0.jpg","notifications":{"-Jx6fpaJHvKPHc8CylPd":{},"-JxAQpWGzY-gOzej7Xis":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"4","posteddate":1440079641420,"type":"system"}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{},"-JxWuZ-ePcx0XqYRmzc6":{}}}}] |
+--------+
1 row selected (1.774 seconds)

由于我在列中有列表,因此仍然无法以您想要的方式查询。所以使用FLATTEN:

0: jdbc:drill:zk=local> select flatten(kvgen(t.`user`)) as `user` from dfs.`/Users/vince/data/stackoverflow/users_kv.json` t;
+------+
| user |
+------+
| {"key":"3","value":{"company":"","d_year":"","email":"mario.giambanco@domain.com","facebook":"","fullname":"Mario Test","google":"","igoto":"","image":"","notifications":{"-Jx6fpaJHvKPHc8CylPd":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"3","posteddate":1440016723546,"type":"system"},"-JxAQpWGzY-gOzej7Xis":{}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{"title":"Anesthesia"},"-JxWuZ-ePcx0XqYRmzc6":{"title":"Bridges"}},"twitter":""}} |
| {"key":"4","value":{"company":"","d_year":"","email":"mariogiambanco@domain.com","fullname":"mario test","igoto":"","image":"img/a0.jpg","notifications":{"-Jx6fpaJHvKPHc8CylPd":{},"-JxAQpWGzY-gOzej7Xis":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"4","posteddate":1440079641420,"type":"system"}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{},"-JxWuZ-ePcx0XqYRmzc6":{}}}} |
+------+
2 rows selected (0.257 seconds)

两行 - 好多了。现在你已经准备好一直做你想做的事了(注意子查询以及保留字用户和值的反引号:

    0: jdbc:drill:zk=local> select u.`user`.`key` as userid, u.`user`.`value`.fullname as fullname, u.`user`.`value`.email as email from (select flatten(kvgen(t.`user`)) as `user` from dfs.`/Users/vince/data/stackoverflow/users_kv.json` t) u where u.`user`.`value`.fullname = 'Mario Test';
+---------+-------------+-----------------------------+
| userid  |  fullname   |            email            |
+---------+-------------+-----------------------------+
| 3       | Mario Test  | mario.giambanco@domain.com  |
+---------+-------------+-----------------------------+
1 row selected (0.22 seconds)

答案 1 :(得分:1)

如果实际上键“3”和“4”是id,则键不应该保持值。为Drill格式化这个JSON的更好方法是使用这些值的实际键(还注意每个文件可以有多个记录,Drill可以解析它们):

{ "id": 3,
  "data": {
       ...
  }
}
{ "id": 4,
   "data": {
        ...
   }
}

这样你可以做这样的查询:

> select t.`id`, t.`data`.`fullname` as `fullname` from `firebase.json` t;
+-----+-------------+
| id  |  fullname   |
+-----+-------------+
| 3   | Mario Test  |
| 4   | mario test  |
+-----+-------------+
2 rows selected (0.269 seconds)