Question

我正在尝试使用yajl-py解析GitHub存档文件。我相信文件的基本格式是JSON对象流，因此文件本身不是有效的JSON，但它包含的对象是。

为了测试这一点，我安装了yajl-py，然后使用他们的示例解析器（来自https://github.com/pykler/yajl-py/blob/master/examples/yajl_py_example.py）来尝试解析文件：

python yajl_py_example.py < 2012-03-12-0.json

其中2012-03-12-0.json是已解压缩的GitHub存档文件之一。

看来这种事情应该来自他们在Ruby中的参考实现。 Python包不能处理JSON流吗？

顺便说一句，这是我得到的错误：

yajl.yajl_common.YajlError: parse error: trailing garbage
          9478bbc3","type":"PushEvent"}{"repository":{"url":"https://g
                     (right here) ------^

Answer 1

您需要使用流解析器来读取数据。 Yajl支持流解析，允许您从文件/流一次读取一个对象。话虽如此，看起来Python并没有为Yajl工作绑定..

py-yajl已iterload注释掉，不确定原因：https://github.com/rtyler/py-yajl/commit/a618f66005e9798af848c15d9aa35c60331e6687#L1R264

不是Python解决方案，但您可以使用Ruby绑定读取数据并以您需要的格式发出它：

# gem install yajl-ruby

require 'open-uri'
require 'zlib'
require 'yajl'

gz = open('http://data.githubarchive.org/2012-03-11-12.json.gz')
js = Zlib::GzipReader.new(gz).read

Yajl::Parser.parse(js) do |event|
  print event
end

Answer 2

该示例未启用任何Yajl额外功能，因此您需要在解析器上启用allow_multiple_values标志。以下是您需要修改基本示例以解析文件的内容。

--- a/examples/yajl_py_example.py
+++ b/examples/yajl_py_example.py
@@ -37,6 +37,7 @@ class ContentHandler(YajlContentHandler):

 def main(args):
     parser = YajlParser(ContentHandler())
+    parser.allow_multiple_values = True
     if args:
         for fn in args:
             f = open(fn)

Yajl-Py是围绕yajl的瘦包装器，因此您可以使用Yajl提供的所有功能。以下是flags that yajl provides that you can enable：

的全部内容

yajl_allow_comments
yajl_dont_validate_strings
yajl_allow_trailing_garbage
yajl_allow_multiple_values
yajl_allow_partial_values

要在yajl-py中打开它们，请执行以下操作：

parser = YajlParser(ContentHandler())
# enabling these features, note that to make it more pythonic, the prefix `yajl_` was removed
parser.allow_comments = True
parser.dont_validate_strings = True
parser.allow_trailing_garbage = True
parser.allow_multiple_values = True
parser.allow_partial_values = True
# then go ahead and parse
parser.parse()

Answer 3

我知道这已经得到了解答，但我更喜欢以下方法，它不使用任何软件包。由于某种原因，github字典在一行上，所以你不能假设每行一个字典。这看起来像：

{"json-key":"json-val", "sub-dict":{"sub-key":"sub-val"}}{"json-key2":"json-val2", "sub-dict2":{"sub-key2":"sub-val2"}}

我决定创建一个一次获取一个字典的函数。它将json作为字符串返回。

def read_next_dictionary(f):
    depth = 0
    json_str = ""
    while True:
        c = f.read(1)
        if not c:
            break #EOF
        json_str += str(c)
        if c == '{':
            depth += 1
        elif c == '}':
            depth -= 1

        if depth == 0:
            break

    return json_str

我使用此函数通过while循环遍历Github存档：

arr_of_dicts = []
f = open(file_path)
while True:
    json_as_str = read_next_dictionary(f)
    try:
        json_dict = json.loads(json_as_str)
        arr_of_dicts.append(json_dict)
    except: 
        break # exception on loading json to end loop

pprint.pprint(arr_of_dicts)

这适用于此处的数据集帖子：http://www.githubarchive.org/（在gunzip之后）

Answer 4

作为一种解决方法，您可以将GitHub存档文件拆分为行，然后将每行解析为json：

import json
with open('2013-05-31-10.json') as f:
    lines = f.read().splitlines()
    for line in lines:
        rec = json.loads(line)
        ...

Yajl使用Python中的githubarchive.org JSON流解析错误

4 个答案: