好吧,因为json打败我的另一个令人沮丧的日子非常糟糕。如果这对某人而言并不可怕,那么你就是我的新榜样。对不起,我甚至没有合理的尝试。我有成千上万的文件具有以下结构,下面甚至只是文件中的一个示例,所以想象下面的例子,我需要格式化为csv格式加载到数据库和查询中的更多行。是的,每一行在技术上都是一个json对象,但每一行都有可变结构,其中一些具有嵌套键,而另一些则没有。如果有人能让我朝着正确的方向前进,那么我将非常感激。为了使事情变得更加糟糕,文件特定部分的行数永远不会一致,所以当我尝试编写一个只读取前20行的程序时,因为至少我可以单独处理顶部部分,我遇到了一个数字关闭的问题。
这是文件的顶部部分:
{
"key":[
{"key":["val"],"key":{"key":"val","key":"val", "key":{"key":"val", "key":"val"}, "key":{"key":"val"}, "key":"val"}, "key":"val"},
{"key":["val","val","val","val"],"key":{"key":"val","key":"val"},"key":"val"},
{"key":["val"],"key":{"key":"val","key":"val", "key":{"key":"val", "key":"val"}, "key":{"key":"val"}, "key":"val"}, "key":"val"},
{"key":["val","val","val","val"],"key":{"key":"val","key":"val"},"key":"val"}
],
这就是文件底部的样子:
"key":[
{"key":"val","key":"val","key":["val", "val", "val", "val", "val", "val"]},
{"key":"val","key":"val","key":["val", "val", "val", "val", "val", "val"]},
{"key":"val","key":"val","key":["val", "val", "val", "val", "val", "val"]}
]
}
答案 0 :(得分:1)
您给出的测试数据有点令人讨厌,因为:
你已经用“key”替换了每一个键,这使得json.load()返回单项词典,其中大部分数据被踩踏;
它实际上与您的描述不符;它是一个完全有效的单个json对象,而不是每隔几行的json对象。
所以我编写了以下测试数据:
{"a": 35, "c": 16, "b": 98,
"e": 47, "d": 98, "f": 82}
{"a": 41, "c": 18, "b": 32, "e": 76, "d": 66, "f": 92}
{"a": 43, "c": 79, "b": 62, "e": 55,
"d": 86, "f": 61}
{"a": 47, "c": 49, "b": 87,
"e": 85, "d": 14, "f": 46}
{"a": 60, "c": 17, "b": 36, "e": 55, "d": 25, "f": 84}
{"a": 61, "c": 38, "b": 93, "e": 26, "d": 12, "f": 82}
然后我找到了以下
import json
def iload_json(buff, decoder=None, _w=json.decoder.WHITESPACE.match):
# found at http://www.benweaver.com/blog/decode-multiple-json-objects-in-python.html
"""Generate a sequence of top-level JSON values declared in the
buffer.
>>> list(iload_json('[1, 2] "a" { "c": 3 }'))
[[1, 2], u'a', {u'c': 3}]
"""
decoder = decoder or json._default_decoder
idx = _w(buff, 0).end()
end = len(buff)
try:
while idx != end:
(val, idx) = decoder.raw_decode(buff, idx=idx)
yield val
idx = _w(buff, idx).end()
except ValueError as exc:
raise ValueError('%s (%r at position %d).' % (exc, buff[idx:], idx))
可以用作
import glob
from itertools import chain
def gen_json_from_file(fname):
with open(fname) as inf:
try:
for obj in iload_json(inf.read()):
yield obj
except ValueError, e:
print("Error parsing file '{}': {}".format(fname, e.message))
def gen_json_from_files(filespec):
return chain(*(gen_json_from_file(fname) for fname in glob.glob(filespec)))
for obj in gen_json_from_files("*.json")):
try:
print(obj["a"])
except KeyError:
pass
(针对上述测试数据运行两次作为“a.json”和“b.json”运行)导致
35
41
43
47
60
61
35
41
43
47
60
61
正如所料。
答案 1 :(得分:0)
所以 - 解析这并不困难,但是,鉴于您的样本,比您描述的更容易。
如果“每一行都是一个JSON对象” - 您所要做的就是将每一行输入到json解析器中,并在列表中收集生成的对象:
import json
for filename in os.listitdir(<path_to_thousands_of_json_files>):
data = []
with open(filename) as jsonfile:
for line in jsonfile:
if not line.strip(): continue #avoid crash at empty lines and newline at end of file data.append(json.loads(line.strip()))
# do your CSV output processing here.
但是,在上面的示例中,每一行都不是一个完整的json文件 - 它更像是整个文件是有效的json对象,因为它是常态,所以jsut在做:
import json
for filename in os.listitdir(<path_to_thousands_of_json_files>):
data = json.load(open(filename))
# do CSV output
应该为你做好工作。
现在,这是用于解析 - 如果你的问题只是关于这一点,那么这些应该足以作为答案。我想要获取数据的含义并选择要在每个CSV生成的文件中输出的字段和标题将是一个更大的问题 - 但是他们,也许你可以工作,直到你得到解析工作,并发布更多问题与更具体的例子你想要得到什么;
请注意,为了处理数千个文件,使用Python的迭代器“模式”是明智的,这样您就可以将上述解析逻辑与处理数据并创建输出的部分分开,并且每次在内存中解析的单个JSON文件:
import json
def get_json_data(path_to_files):
for filename in os.listitdir(path_to_files):
data = json.load(open(filename))
yield data
def main():
for data in get_json_data(<path_to_files>):
# implement CSV logic here.