Question

我有一个来自stackoverflow的Json文件，如下所示：
第一行代表 Json对象的数量，其为2

   2
    {"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: http://i.stack.imgur.com/BS85b.png).  \n\nWhat is the effective capacitance of this circuit and will the ...\r\n        "}
    {"topic":"electronics","question":"Heat sensor with fan cooling","excerpt":"Can I know which component senses heat or acts as heat sensor in the following circuit?\nIn the given diagram, it is said that the 4148 diode acts as the sensor. But basically it is a zener diode and ...\r\n        "}

Json文件的内容如下所示

 question (string) : The text in the title of the question.
    excerpt (string) : Excerpt of the question body.
    topic (string) : The topic under which the question was posted

我正在学习ML，我想将数据解析为以下格式

data[i][0] = contains question
data[i][1] = contains string
data[i][2] = topic

这样我就可以训练我的分类器了。我是python的新手或有一些更好的技术来表示数据，因为我使用它作为列车数据
我写了这段代码，但没有给我错误：

with open("ml.json") as t:
    data = json.load(t)
    print(data)

Answer 1

假设每一行（在第一行之后）包含一个对象（没有对象超过一行），那么这个函数（返回一个生成器，它的内存效率）将起作用。

import json

def loadJsonLines(filePath):
    with open(filePath) as fp:
        objCount = int(fp.readline().strip())
        for i in range(objCount):
            line = fp.readline()
            obj = json.loads(line)
            yield obj


if __name__=='__main__':
    import sys
    from pprint import pprint
    for obj in loadJsonLines(sys.argv[1]):
        pprint(obj)

    objList = list(loadJsonLines(sys.argv[1]))
    pprint(objList)

另请注意，您的文件不是json文件，即使它在每行中包含json数据（第一行除了整数），但整个文件不是json ，所以你不应该给它一个 .json 扩展名。

在Python

1 个答案: