如何使用多个条件过滤jsonb?

时间:2015-06-19 02:04:26

标签: postgresql postgresql-9.4 jsonb

我有以下表结构:

CREATE TABLE mytable (
  id   serial PRIMARY KEY,
  data jsonb
);

以下数据(部分为了简洁起见......注意年份的随机性和销售/费用彼此不一致):

INSERT INTO mytable (data)
VALUES
('{"employee": "Jim Romo", 
 "sales": [{"value": 10, "yr": "2012"}, {"value": 5, "yr": "2013"}, {"value": 40, "yr": "2014"}],
 "expenses": [{"value": 2, "yr": "2007"}, {"value": 1, "yr": "2013"}, {"value": 3, "yr": "2014"}], 
 "product": "tv", "customer": "1", "updated": "20150501"
}'),
('{"employee": "Jim Romo", 
 "sales": [{"value": 10, "yr": "2012"}, {"value": 5, "yr": "2013"}, {"value": 41, "yr": "2014"}],
 "expenses": [{"value": 2, "yr": "2009"}, {"value": 3, "yr": "2013"}, {"value": 3, "yr": "2014"}], 
 "product": "tv", "customer": "2", "updated": "20150312"
}'),
('{"employee": "Jim Romo", 
 "sales": [{"value": 20, "yr": "2012"}, {"value": 25, "yr": "2013"}, {"value": 33, "yr": "2014"}],
 "expenses": [{"value": 8, "yr": "2012"}, {"value": 12, "yr": "2014"}, {"value": 5, "yr": "2009"}], 
 "product": "radio", "customer": "2", "updated": "20150311"
}'),
('{"employee": "Bill Baker", 
 "sales": [{"value": 1, "yr": "2010"}, {"value": 2, "yr": "2009"}, {"value": 3, "yr": "2014"}],
 "expenses": [{"value": 3, "yr": "2011"}, {"value": 1, "yr": "2012"}, {"value": 7, "yr": "2013"}], 
 "product": "tv", "customer": "1", "updated": "20150205"
}'),
('{"employee": "Bill Baker", 
 "sales": [{"value": 10, "yr": "2010"}, {"value": 12, "yr": "2011"}, {"value": 3, "yr": "2014"}],
 "expenses": [{"value": 4, "yr": "2011"}, {"value": 7, "yr": "2009"}, {"value": 4, "yr": "2013"}], 
 "product": "radio", "customer": "1", "updated": "20150204"
}'),
('{"employee": "Jim Romo",
 "sales": [{"value": 22, "yr": "2009"}, {"value": 17, "yr": "2013"}, {"value": 35, "yr": "2014"}],
 "expenses": [{"value": 14, "yr": "2011"}, {"value": 13, "yr": "2014"}, {"value": 8, "yr": "2013"}], 
 "product": "tv", "customer": "3", "updated": "20150118"
}')

对于每位员工,我需要评估最近更新的行,并找到2014年电视销售额超过30的员工。从那里我需要进一步筛选平均电视费用低于5的员工。平均而言我只是需要承担他们所有的电视费用,而不仅仅是最新一行。

我的预期输出为1行:

employee    | customer | 2014 tv sales   |  2013 avg tv expenses
------------+----------+-----------------+----------------------
Jim Romo    |    1     |   40            |  4

我可以(有点)做1或其他但不是两者:

一个。获得2014年销售> 30(但无法获得他们最近的"电视"销售;(

SELECT * FROM mytable WHERE (SELECT (a->>'value')::float FROM
    (SELECT jsonb_array_elements(data->'sales') as a) as b 
    WHERE a @> json_object(ARRAY['yr', '2014'])::jsonb) > 30

湾获得平均2013年费用(这需要平均电视费用)

SELECT avg((a->>'value')::numeric) FROM  
  (SELECT jsonb_array_elements(data->'expenses') as a FROM mytable) as b
  WHERE a @> json_object(ARRAY['yr', '2013'])::jsonb

编辑:这可能是一个非常大的表,所以任何关于性能和索引需求的评论都会受到赞赏,因为我是postgresql和jsonb的新手。

编辑#2:我已经尝试了两个答案,对于一张大桌子来说似乎都没有效率;(

2 个答案:

答案 0 :(得分:1)

这是对您的问题的(相当冗长的)答案。查询中的注释应该解释不同的部分。我遵循的基本思路是:1)保持每个操作的简单,首先尝试产生正确的结果,然后进行优化; 2)尽可能多地(但不是很多)改变json结构中的更多"关系类似"结构,因为关系有更强大的运算符,json数据在postgres中。 Corse中有空间来简化查询,甚至可以生成更高效的版本,但至少这是一个起点。

with mytable1 as   -- transform the table in a more "relational-like" structure (just for clarity)
  (select id, data->>'employee' as employee, data->>'product' as product, 
      (data->>'updated')::integer as updated, (data->>'customer')::integer as customer,
          data->'sales' as sales, data->'expenses' as expenses 
   from mytable),
avg_exp_for_2013_tv as -- find the average expenses for tv in 2013 for each employee
   (select employee, avg(expenses.value) as avg2013_expenses
    from mytable1 , jsonb_to_recordset(expenses) as expenses(yr text, value float)
    where product = 'tv' and expenses.yr = '2013'
    group by employee),
most_recent_updates_employees as  -- find the most recent updates for each employee 
   (select employee, max(updated) as updated
    from mytable1 t1
    group by employee),
most_recent_updated_rows as   -- find the rows with the most recent updates
   (select t1.*
    from mytable1 t1, most_recent_updates_employees m
    where t1.employee = m.employee and t1.updated = m.updated),
employees_with_2014_tv_sales_gt_30 as
   (select employee, customer, sales.value as sales_value
    from most_recent_updated_rows m, jsonb_to_recordset(m.sales) as sales(yr text, value float)
    where yr = '2014' and value > 30)
select e1.employee, e1.customer, e1.sales_value as "2014 tv sales", e2.avg2013_expenses as "2013 avg tv expenses"
from employees_with_2014_tv_sales_gt_30 e1, avg_exp_for_2013_tv e2
where e1.employee = e2.employee and avg2013_expenses < 5

答案 1 :(得分:0)

解压缩多级jsons的最佳方法是逐步为每个级别和每个数组构建记录,在途中选择所需的值。 这样您就可以获得一个漂亮的逻辑分层查询。在问题中描述的情况下,您需要两个连接的查询,因为其中一个必须计算另一个条件下的值的平均值。

select distinct on (employee) employee, customer, sales_2014, avg_expenses_2013::numeric(20,2)
from (
    select s.employee, customer, updated, sales_2014, avg_expenses_2013
    from (
        select employee, customer, updated, (sales->>'value')::int sales_2014
        from (
            select employee, customer, updated, jsonb_array_elements(sales) sales
            from (
                select c.*
                from
                    mytable,
                    jsonb_to_record(data) 
                        as c(employee text, product text, customer text, updated text, sales jsonb)
                ) alias
                where product = 'tv'
            ) alias
        where sales->>'yr' = '2014'
    ) s
    join (
        select employee, avg((expenses->>'value')::numeric) avg_expenses_2013
        from (
            select employee, jsonb_array_elements(expenses) expenses
            from (
                select c.*
                from
                    mytable,
                    jsonb_to_record(data) 
                        as c(employee text, product text, expenses jsonb)
                ) alias
                where product = 'tv'
            ) alias
        where expenses->>'yr' = '2013'
        group by 1
    ) e
    on s.employee = e.employee
    where sales_2014 > 30
) alias
order by employee, updated desc;

  employee  | customer | sales_2014 | avg_expenses_2013
------------+----------+------------+-------------------
 Jim Romo   | 1        | 40         | 4.00
(1 row) 

即使使用像索引这样的优化,大型表上的查询性能也会非常令人失望。如果您必须对此数据进行此类分析,则应重新考虑数据模型,该模型设计不当用于此目的。我遗漏了一些负责任地提出适当变更的关键信息:

  • 该表是否是较大型号的一部分?
  • 模型中是否有关于员工和客户的更多数据?
  • 表的更新形式和频率是多少?
  • 在此表的基础上进行了哪些其他分析?
  • 是否可以从最新的表格中删除条目?

然而,有一件事似乎是肯定的。您应该将json数据解压缩到具有常规类型列的规范化表中。该模型可能如下所示:

create table employees (
    employee_id serial primary key,
    employee_name text);

create table reports (    -- stipulated name of rows of your table
    report_id serial primary key,
    employee_id int references employees,
    product_name text,   -- product_id references products?
    cutomer_no int,      -- customer_id references customers?
    updated_at date);

create table sales (
    sale_id serial primary key,
    report_id int references reports,
    year_no int;
    total int);     -- numeric?

create table expenses (
    expense_id serial primary key,
    report_id int references reports,
    year_no int;
    total int);