将大量数据插入Postgresql

时间:2016-10-21 08:36:59

标签: json postgresql

在将数百万行插入PostgreSQL数据库时,我遇到了性能问题。

我发送的JSON对象有一个数字行的数组。

对于每一行,我在数据库表中创建一条记录。我也尝试过多次插入,但问题仍然存在。

我不知道如何处理这个,我已经读过COPY命令是紧固的。

如何改善表现?

我的JSON对象,日志为数组: 数组日志有一个数百万行。

{"type":"monitoring","log":[
["2016-10-12T20:33:21","0.00","0.00","0.00","0.00","0.0","24.00","1.83","-0.00","1","1","-100.00"],
["2016-10-12T20:33:23","0.00","0.00","0.00","0.00","0.0","24.00","1.52","-0.61","1","1","-100.00"]]}

我当前的代码(我正在构建一个动态语句,以便我可以一次执行多行):

IF(NOT b_first_line) THEN
          s_insert_query_values = right(s_insert_query_values, -1); --remove te leading comma

          EXECUTE format('INSERT INTO log_rlda
                  (record_node_id, log_line, log_value, timestamp, record_log_id)
          VALUES %s;', s_insert_query_values);

          s_insert_query_values = '';
          i_num_lines_buffered = 0;
        END IF;
      END IF;

s_insert_query_values包含:

“log”中数组内的每个值都需要插入到自己的行中(在colum:log_value中)。这就是INSERT的样子(引用s_insert_query_values):

INSERT INTO log_rlda
                  (record_node_id, log_line, log_value, timestamp, record_log_id)
          VALUES  
     (806, 1, 0.00, '2016-10-12 20:33:21', 386),
     (807, 1, 0.00, '2016-10-12 20:33:21', 386),
     (808, 1, 0.00, '2016-10-12 20:33:21', 386),
     (809, 1, 0.00, '2016-10-12 20:33:21', 386),
     (810, 1, 0.0, '2016-10-12 20:33:21', 386),
     (811, 1, 24.00, '2016-10-12 20:33:21', 386),
     (768, 1, 1.83, '2016-10-12 20:33:21', 386),
     (769, 1, 0.00, '2016-10-12 20:33:21', 386),
     (728, 1, 1, '2016-10-12 20:33:21', 386),
     (771, 1, 1, '2016-10-12 20:33:21', 386),
     (729, 1, -100.00, '2016-10-12 20:33:21', 386),
     (806, 2, 0.00, '2016-10-12 20:33:23', 386),
     (807, 2, 0.00, '2016-10-12 20:33:23', 386),
     (808, 2, 0.00, '2016-10-12 20:33:23', 386),
     (809, 2, 0.00, '2016-10-12 20:33:23', 386),
     (810, 2, 0.0, '2016-10-12 20:33:23', 386),
     (811, 2, 24.00, '2016-10-12 20:33:23', 386),
     (768, 2, 1.52, '2016-10-12 20:33:23', 386),
     (769, 2, -0.61, '2016-10-12 20:33:23', 386),
     (728, 2, 1, '2016-10-12 20:33:23', 386),
     (771, 2, 1, '2016-10-12 20:33:23', 386),
     (729, 2, -100.00, '2016-10-12 20:33:23', 386)

解决方案(i_node_id_list包含我在此查询之前选择的ID):

SELECT i_node_id_list[log_value_index] AS record_node_id,
                    e.log_line-1 AS log_line,
                    items.log_value::double precision as log_value,
                    to_timestamp((e.line->>0)::text, 'YYYY-MM-DD HH24:MI:SS') as "timestamp",
                    i_log_id as record_log_id
              FROM (VALUES (log_data::json)) as data (doc),
                json_array_elements(doc->'log') with ordinality as e(line, log_line),
                json_array_elements_text(e.line)     with ordinality as items(log_value, log_value_index)
              WHERE  log_value_index > 1 --dont include timestamp value (shouldnt be written as log_value)
              AND  log_line  > 1

1 个答案:

答案 0 :(得分:1)

您需要两个级别的取消。

select e.log_line, items.log_value, e.line -> 0 as timestamp
from (
  values ('{"type":"monitoring","log":[
  ["2016-10-12T20:33:21","0.00","0.00","0.00","0.00","0.0","24.00","1.83","-0.00","1","1","-100.00"],
  ["2016-10-12T20:33:23","0.00","0.00","0.00","0.00","0.0","24.00","1.52","-0.61","1","1","-100.00"]]}'::json)
) as data (doc), 
  json_array_elements(doc->'log') with ordinality as e(line, log_line), 
  json_array_elements(e.line)   with ordinality as items(log_value, log_value_index)
where log_value_index > 1;

第一次调用json_array_elements()会从log属性中提取所有数组元素。 with ordinality允许我们识别该数组中的每一行。第二个调用然后从行中获取每个元素,with ordinality允许我们找出数组中的位置。

以上查询返回:

log_line | log_value | timestamp            
---------+-----------+----------------------
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.0"     | "2016-10-12T20:33:21"
       1 | "24.00"   | "2016-10-12T20:33:21"
       1 | "1.83"    | "2016-10-12T20:33:21"
       1 | "-0.00"   | "2016-10-12T20:33:21"
       1 | "1"       | "2016-10-12T20:33:21"
       1 | "1"       | "2016-10-12T20:33:21"
       1 | "-100.00" | "2016-10-12T20:33:21"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.0"     | "2016-10-12T20:33:23"
       2 | "24.00"   | "2016-10-12T20:33:23"
       2 | "1.52"    | "2016-10-12T20:33:23"
       2 | "-0.61"   | "2016-10-12T20:33:23"
       2 | "1"       | "2016-10-12T20:33:23"
       2 | "1"       | "2016-10-12T20:33:23"
       2 | "-100.00" | "2016-10-12T20:33:23"

然后可以使用上述语句的结果直接插入数据而不循环遍历它。这应该比单个插入很多快得多。

我不确定如何将正确的record_node_idrecord_log_id整合到上述结果中。