jq可以跨文件执行聚合

时间:2015-11-05 01:31:41

标签: json matlab csv elasticsearch jq

我正在尝试识别一个程序/软件,它可以让我有效地获取大量的CSV文件(总计40多GB)并输出一个JSON文件,其中包含导入Elasticsearch(ES)所需的特定格式。

jq能否有效地获取这样的数据:

file1:
id,age,gender,wave
1,49,M,1
2,72,F,0

file2:
id,time,event1
1,4/20/2095,V39
1,4/21/2095,T21
2,5/17/2094,V39

通过id聚合(这样多个文件中CSV行的所有JSON文档都属于一个id条目),输出如下内容:

{"index":{"_index":"forum_mat","_type":"subject","_id":"1"}}
{"id":"1","file1":[{"filen":"file1","id":"1","age":"49","gender":"M","wave":"1"}],"file2":[{"filen":"file2","id":"1","time":"4/20/2095","event1":"V39"},{"filen":"file2","id":"1","time":"4/21/2095","event1":"T21"}]}
{"index":{"_index":"forum_mat","_type":"subject","_id":"2"}}
{"id":"2","file1":[{"filen":"file1","id":"2","age":"72","gender":"F","wave":"0"}],"file2":[{"filen":"file2","id":"2","time":"5/17/2094","event1":"V39"}]}

我在Matlab中写了一个脚本但是我担心它会慢很多。我可能需要数月时间来处理所有40 + GB的数据。我是informed,Logstash(这是ES的首选数据输入工具)并不擅长这种类型的聚合。

4 个答案:

答案 0 :(得分:0)

以下内容,我相信,按照您的意愿完成,但我并不完全理解输入文件与您包含的输出之间的联系。希望这至少会让你走上正轨。

该程序假设您的所有数据都适合内存。它使用JSON对象作为快速查找的字典,因此应该非常高效。

这里采用的方法将csv-to-json转换与聚合分开,因为可能有更好的方法来完成前者。 (例如参见the jq Cookbook entry on convert-a-csv-file-with-headers-to-json。)

第一个文件(scsv2json.jq)用于将简单CSV转换为JSON。第二个文件(aggregate.jq)执行聚合。有了这些:

$ (jq -R -s -f scsv2json.jq file1.csv ;\ jq -R -s -f scsv2json.jq file2.csv) |\ jq -s -c -f aggregate.jq [{"id":"1", "file1":{"age":"49","gender":"M","wave":"1"}, "file2":{"time":"4/21/2095","event1":"T21"}}, {"id":"2", "file1":{"age":"72","gender":"F","wave":"0"}, "file2":{"time":"5/17/2094","event1":"V39"}}]

请注意,“id”已从输出中的内部对象中删除。

aggregate.jq:

# Input: an array of objects, each with an "id" field
# such that (tostring|.id) is an index.
# Output: a dictionary keyed by the id field.
def todictionary:
  reduce .[] as $row ( {}; . + { ($row.id | tostring): $row } );

def aggregate:
  .[0] as $file1
  | .[1] as $file2
  | ($file1 | todictionary) as $d1
  | ($file2 | todictionary) as $d2
  | ( [$file1[].id] + [$file2[].id] | unique ) as $keys
  | reduce ($keys[] | tostring) as $k
      ( [];
        . + [{"id": $k, 
              "file1": ($d1[$k] | del(.id)),
              "file2": ($d2[$k] | del(.id)) }] );

aggregate

scsv2json.jq

def objectify(headers):
  . as $in
  | reduce range(0; headers|length) as $i
      ({}; .[headers[$i]] = ($in[$i]) );

def csv2table:
  def trim: sub("^ +";"") |  sub(" +$";"");
  split("\n") | map( split(",") | map(trim) );

def csv2json:
  csv2table
  | .[0] as $headers
  | reduce (.[1:][] | select(length > 0) ) as $row
      ( []; . + [ $row|objectify($headers) ]);

csv2json

以上假设正在使用支持正则表达式的jq版本。如果您的jq没有正则表达式支持,只需省略修剪。

答案 1 :(得分:0)

这是一种内存密集度较低的方法。它只需要file1 保存在内存中:第二个文件一次处理一行。

调用如下:

$ jq -n -R --argfile file1 <(jq -R -s -f scsv2json.jq file1.csv)\
     -f aggregate.jq file2.csv

其中scsv2json.jq如上一篇文章中所示。这里不再重复,主要是因为(如其他地方所指出的)其他一些以相同方式将CSV转换为JSON的程序可能是合适的。

aggregate.jq:

def objectify(headers):
  . as $in
  | reduce range(0; headers|length) as $i
      ({}; .[headers[$i]] = ($in[$i]) );

def csv2table:
  def trim: sub("^ +";"") |  sub(" +$";"");
  split("\n") | map( split(",") | map(trim) );

# Input: an array of objects, each with an "id" field
# such that (tostring|.id) is an index.
# Output: a dictionary keyed by the id field.
def todictionary:
  reduce .[] as $row ( {}; . + { ($row.id | tostring): $row } );

# input: {"id": ID } + OBJECT2
# dict: {ID: OBJECT1, ...}
# output: {id: ID, "file1": OBJECT1, "file2": OBJECT2}
def aggregate(dict):
  .id as $id
  | (dict[$id] | del(.id)) as $o1
  | {"id": $id,
     "file1": $o1,
     "file2":  del(.id) };

# $file1 is the JSON version of file1.csv -- an array of objects
(input | csv2table[0]) as $headers
| inputs
| csv2table[0]
| objectify($headers) 
| ($file1 | todictionary) as $d1
| aggregate($d1)

答案 2 :(得分:0)

这是一种jq内存要求非常小的方法。它假定您已经能够将所有.csv文件合并到一个形式为JSON数组的流(或文件)中:

[id, sourceFile, baggage]

其中id的值按排序顺序排列。流可能如下所示:

 [1,"file1", {"a":1}]
 [1,"file2", {"b":1}]
 [1,"file3", {"c":1}]
 [2,"file1", {"d":1}]
 [2,"file2", {"e":1}]
 [3,"file1", {"f":1}]

此初步步骤需要全局排序,因此您可能需要仔细选择排序实用程序。

您可以拥有尽可能多的文件来源;每个阵列都不需要放在一条线上;并且id值不必是整数 - 例如,它们可以是字符串。

我们假设上面的文件名为combined.json,而aggregate.jq的内容如下所示。然后调用:

$ jq -c -n -f aggregate.jq combined.json

会产生:

{"id":1,"file1":{"a":1},"file2":{"b":1},"file3":{"c":1}}
{"id":2,"file1":{"d":1},"file2":{"e":1}}
{"id":3,"file1":{"f":1}}

CORRECTED:aggregate.jq:

foreach (inputs,null) as $row
  # At each iteration, if .emit then emit it
  ( {"emit": null, "current": null};

    if $row == null
    then {emit: .current, current: null}          # signal EOF
    else  {id: $row[0], ($row[1]) : $row[2] } as $this
    | if .current == null
      then {emit: null, current: $this}
      elif $row[0] == .current.id
      then .emit = null | .current += $this
      else {emit: .current, current: $this}
      end
    end;
    if .emit then .emit else empty end
  )

答案 3 :(得分:0)

正如其中一条评论中所建议的那样,我最终使用SQL以我所需的格式导出JSON。另一个thread帮了大忙。最后,我选择将给定的SQL表输出到自己的JSON文件而不是组合它们(文件大小变得无法管理)。这是执行此操作的代码结构,以便您为Bulk API和JSON数据行生成命令行:

create or replace function format_data_line(command text, data_str text)
returns setof text language plpgsql as $$
begin
    return next command;
    return next             
        replace(
            regexp_replace(data_str,
                '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'),
            e' \n ', '');
end $$;

COPY (
    with f_1 as(
       SELECT id, json_agg(fileX.*) AS tag
       FROM forum.file3
       GROUP BY id
    )
    SELECT 
        format_data_line(
            format('{"update":{"_index":"forum2","_type":"subject","_id":%s}}',a.id),
            format('{"doc":{"id":%s,"fileX":%s}}', 
                a.id, a.tag))
    FROM f_1 a 
) TO '/path/to/json/fileX.json';

使用Bulk API导入较大的文件也证明存在问题(内存不足Java错误),因此需要一个脚本才能在给定时间将数据子集发送到Curl(用于Elasticsearch中的索引)。该脚本的基本结构是:

#!/bin/bash

FILE=$1
INC=100
numline=`wc -l $FILE | awk '{print $1}'`
rm -f output/$FILE.txt
for i in `seq 1 $INC $numline`; do
    TIME=`date +%H:%M:%S`
    echo "[$TIME] Processing lines from $i to $((i + INC -1))"
    rm -f intermediates/interm_file_$i.json
    sed -n $i,$((i +INC - 1))p $FILE >> intermediates/interm_file_$i.json
    curl -s -XPOST localhost:9200/_bulk --data-binary @intermediates/interm_file_$i.json >> output/$FILE.txt
done

&#34;中间体&#34;应在脚本文件目录下创建目录。该脚本可以保存为&#34; ESscript&#34;并在命令行上运行:

./ESscript fileX.json