Question

我将有一个可能非常大的JSON文件，我想从它流式传输而不是将它全部加载到内存中。基于JSON::XS的以下陈述（我补充说明），我认为它不符合我的需要。是否有一个Perl 5 JSON模块可以从磁盘传输结果？

在某些情况下，需要对JSON文本进行增量解析。虽然此模块始终必须同时在内存中保留JSON文本和生成的Perl数据结构，但它确实允许您以递增方式解析JSON流。它通过累积文本直到它具有完整的JSON对象来实现，然后它可以解码。此过程类似于使用decode_prefix查看完整的JSON对象是否可用，但效率更高（并且可以使用最少的方法调用来实现）。

为了澄清，JSON将包含一个对象数组。我想从文件中一次读取一个对象。

Answer 1

在易用性和速度方面，JSON::SL似乎是赢家：

#!/usr/bin/perl

use strict;
use warnings;

use JSON::SL;

my $p = JSON::SL->new;

#look for everthing past the first level (i.e. everything in the array)
$p->set_jsonpointer(["/^"]);

local $/ = \5; #read only 5 bytes at a time
while (my $buf = <DATA>) {
    $p->feed($buf); #parse what you can
    #fetch anything that completed the parse and matches the JSON Pointer
    while (my $obj = $p->fetch) {
        print "$obj->{Value}{n}: $obj->{Value}{s}\n";
    }
}

__DATA__
[
    { "n": 0, "s": "zero" },
    { "n": 1, "s": "one"  },
    { "n": 2, "s": "two"  }
]

JSON::Streaming::Reader没问题，但速度较慢，并且接口过于冗长（尽管许多代码都不需要，但所有这些代码都是必需的）：

#!/usr/bin/perl

use strict;
use warnings;

use JSON::Streaming::Reader;

my $p = JSON::Streaming::Reader->for_stream(\*DATA);

my $obj;
my $attr;
$p->process_tokens(
    start_array    => sub {}, #who cares?
    end_array      => sub {}, #who cares?
    end_property   => sub {}, #who cares?
    start_object   => sub { $obj = {}; },     #clear the current object
    start_property => sub { $attr = shift; }, #get the name of the attribute
    #add the value of the attribute to the object
    add_string     => sub { $obj->{$attr} = shift; },
    add_number     => sub { $obj->{$attr} = shift; },
    #object has finished parsing, it can be used now
    end_object     => sub { print "$obj->{n}: $obj->{s}\n"; },
);

__DATA__
[
    { "n": 0, "s": "zero" },
    { "n": 1, "s": "one"  },
    { "n": 2, "s": "two"  }
]

要解析1,000条记录，花费JSON::SL .2秒和JSON::Streaming::Reader 3.6秒（注意，JSON::SL一次4 k，我无法控制JSON :: Streaming :: Reader的缓冲区大小）。

Answer 2

在search.cpan.org上搜索“JSON Stream”时，您是否看过JSON::Streaming::Reader首先出现的内容？

通过搜索“JSON SAX”找到JSON::SL - 不是那么明显的搜索词，但你所描述的听起来像是一个用于XML的SAX解析器。

Answer 3

它通过累积文本直到它有一个完整的JSON对象来实现，然后它可以解码。

这就是你的结果。 JSON文档是一个对象。

您需要更清楚地定义增量解析所需的内容。您在寻找大型映射的一个元素吗？你想用你读/写的信息做什么？

我不知道任何库会通过一次从数组中读取一个元素来逐步解析JSON数据。但是，使用有限状态自动机实现这一点非常简单（基本上你的文件格式为\s*\[\s*([^,]+,)*([^,]+)?\s*\]\s*，除了你需要正确解析字符串中的逗号。）

Answer 4

您是否尝试跳过第一个右括号[，然后跳过逗号,：

$json->incr_text =~ s/^ \s* \[ //x;
...
$json->incr_text =~ s/^ \s* , //x;
...
$json->incr_text =~ s/^ \s* \] //x;

就像在第三个例子中一样： http://search.cpan.org/dist/JSON-XS/XS.pm#EXAMPLES

Answer 5

如果您可以控制如何生成JSON，那么我建议关闭相当格式并在每行打印一个对象。这使解析变得简单，如下所示：

use Data::Dumper;
use JSON::Parse 'json_to_perl';
use JSON;
use JSON::SL;
my $json_sl = JSON::SL->new();
use JSON::XS;
my $json_xs = JSON::XS->new();
$json_xs = $json_xs->pretty(0);
#$json_xs = $json_xs->utf8(1);
#$json_xs = $json_xs->ascii(0);
#$json_xs = $json_xs->allow_unknown(1);

my ($file) = @ARGV;
unless( defined $file && -f $file )
{
  print STDERR "usage: $0 FILE\n";
  exit 1;
}


my @cmd = ( qw( CMD ARGS ), $file );
open my $JSON, '-|', @cmd or die "Failed to exec @cmd: $!";

# local $/ = \4096; #read 4k at a time
while( my $line = <$JSON> )
{
  if( my $obj = json($line) )
  {
     print Dumper($obj);
  }
  else
  {
     die "error: failed to parse line - $line";
  }
  exit if( $. == 5 );
}

exit 0;

sub json
{
  my ($data) = @_;

  return decode_json($data);
}

sub json_parse
{
  my ($data) = @_;

  return json_to_perl($data);
}

sub json_xs
{
  my ($data) = @_;

  return $json_xs->decode($data);
}

sub json_xs_incremental
{
  my ($data) = @_;
  my $result = [];

  $json_xs->incr_parse($data);  # void context, so no parsing
  push( @$result, $_ ) for( $json_xs->incr_parse );

  return $result;
}

sub json_sl_incremental
{
  my ($data) = @_;
  my $result = [];

  $json_sl->feed($data);
  push( @$result, $_ ) for( $json_sl->fetch );
  # ? error: JSON::SL - Got error CANT_INSERT at position 552 at json_to_perl.pl line 82, <$JSON> line 2.

  return $result;
}

如何从文件中流式传输JSON？

5 个答案: