Question

我正在使用一个非常大的数据集，我遇到了一个无法找到答案的问题。我正在尝试解析JSON中的数据，这是我对整个数据集中的一段数据所做的工作：

import json

s = set()

with open("data.raw", "r") as f:

    for line in f:
        d = json.loads(line)

令人困惑的部分是，当我在主数据上应用此代码（大小约为200G）时，它会显示以下错误（不会出现内存不足）：

    d = json.loads(line)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

类型（f）= TextIOWrapper，如果它有帮助......但是这个数据类型也适用于小数据集......

以下几行数据可以看到格式：

{"MessageType": "SALES.CONTRACTS.SALESTATUSCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "OldStatus": {"Status": 3, "AutoRemoveInfo": null}, "NewStatus": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T13:39:57", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALESHIPPINGINFOCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "OldShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "NewShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "0001-01-01T00:00:00", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-4851828-6514632"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.1402505", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL Blanket Seahawks"}, "Quantity": 1, "UnitPrice": {"amount": 22.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T15:51:12", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-102-3824485-2270645"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.3436109", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL CD Wallet Chargers"}, "Quantity": 1, "UnitPrice": {"amount": 12.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-12T02:49:58", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}

这是Json，因为我已经解析了前2000行并且它完美无缺。但是当我尝试对大文件使用相同的过程时，它会从数据的第一行显示错误。

Answer 1

以下是一些简单的代码，用于查看哪些数据无效JSON及其位置：

for i, line in enumerate(f):
    try:
        d = json.loads(line)
    except json.decoder.JSONDecodeError:
        print('Error on line', i + 1, ':\n', repr(line))

Answer 2

读取大型json数据集的一个很好的解决方案，就是在yield中使用python之类的生成器，因为如果你的json解析器将整个文件存储在内存中，它对于你的RAM来说太大了，一步一步地用迭代器保存内存。

您可以将迭代JSON解析器与Pythonic接口http://pypi.python.org/pypi/ijson/一起使用。

但是这里你的文件有.raw扩展名，它不是json文件。

要阅读：

import numpy as np

content = np.fromfile("data.raw", dtype=np.int16, sep="")

但是这个解决方案可能会因大文件而崩溃。

如果事件.raw似乎是.csv文件，那么您可以像这样创建您的阅读器：

import csv

def read_big_file(filename):
    with open(filename, "rb") as csvfile:
         reader = csv.reader(csvfile)
         for row in reader:
             yield row

或者喜欢taht for text file：

def read_big_file(filename):
    with open(filename, "r") as _file:
         for line in _file:
             yield line

仅当您的文件是二进制文件时才使用rb。

执行：

for line in read_big_file(filename):
    <treatment>
    <free memory after a size of chunk>

如果你给出文件的第一行，我可以准确地回答你。

Answer 3

下面是一个示例json数据。它包含两个人的记录。但是也可能是一百万。下面的代码是一种解决方案，它逐行读取文件并一次从一个人那里检索数据，并将其作为json对象返回。

数据：

[
  {
    "Name" : "Joy",
    "Address" : "123 Main St",
    "Schools" : [
      "University of Chicago",
      "Purdue University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Guitar",
        "Level" : "Expert"
      },
      {
        "percussion" : "Drum",
        "Level" : "Professional"
      }
    ],
    "Status" : "Student",
    "id" : 111,
    "AltID" : "J111"
  },
  {
    "Name" : "Mary",
    "Address" : "452 Jubal St",
    "Schools" : [
      "University of Pensylvania",
      "Washington University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Violin",
        "Level" : "Expert"
      },
      {
        "percussion" : "Piano",
        "Level" : "Professional"
      }
    ],
    "Status" : "Employed",
    "id" : 112,
    "AltID" : "M112"
  }
  }
]

代码：导入json

curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
    #Reading file line by line
    line = fp.readline()
    lnum = 0
    while line:
        for a in line:
            if a == '{':
                curly_idx.append(lnum)
                first_curly_found = True
            elif a == '}':
                curly_idx.pop()

        # when the right curly for every left curly is found,
        # it would mean that one complete data element was read
        if len(curly_idx) == 0 and first_curly_found:
            jstr = f'{jstr}{line}'
            jstr = jstr.rstrip()
            jstr = jstr.rstrip(',')
            jstr[:-1]
            print("------------")
            if len(jstr) > 10:
                print("making json")
                j = json.loads(jstr)
            print(jstr)
            jstr = ""
            line = fp.readline()
            lnum += 1
            continue

        if first_curly_found:
            jstr = f'{jstr}{line}'

        line = fp.readline()
        lnum += 1
        if lnum > 100:
            break

如何在python中解析BIG JSON文件

3 个答案: