我正在使用一个非常大的数据集,我遇到了一个无法找到答案的问题。 我正在尝试解析JSON中的数据,这是我对整个数据集中的一段数据所做的工作:
import json
s = set()
with open("data.raw", "r") as f:
for line in f:
d = json.loads(line)
令人困惑的部分是,当我在主数据上应用此代码(大小约为200G)时,它会显示以下错误(不会出现内存不足):
d = json.loads(line)
File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
类型(f)= TextIOWrapper,如果它有帮助......但是这个数据类型也适用于小数据集......
以下几行数据可以看到格式:
{"MessageType": "SALES.CONTRACTS.SALESTATUSCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "OldStatus": {"Status": 3, "AutoRemoveInfo": null}, "NewStatus": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T13:39:57", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALESHIPPINGINFOCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "OldShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "NewShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "0001-01-01T00:00:00", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-4851828-6514632"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.1402505", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL Blanket Seahawks"}, "Quantity": 1, "UnitPrice": {"amount": 22.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T15:51:12", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-102-3824485-2270645"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.3436109", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL CD Wallet Chargers"}, "Quantity": 1, "UnitPrice": {"amount": 12.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-12T02:49:58", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}
这是Json,因为我已经解析了前2000行并且它完美无缺。但是当我尝试对大文件使用相同的过程时,它会从数据的第一行显示错误。
答案 0 :(得分:2)
以下是一些简单的代码,用于查看哪些数据无效JSON及其位置:
for i, line in enumerate(f):
try:
d = json.loads(line)
except json.decoder.JSONDecodeError:
print('Error on line', i + 1, ':\n', repr(line))
答案 1 :(得分:1)
读取大型json数据集的一个很好的解决方案,就是在yield
中使用python
之类的生成器,因为如果你的json解析器将整个文件存储在内存中,它对于你的RAM来说太大了,一步一步地用迭代器保存内存。
您可以将迭代JSON解析器与Pythonic接口http://pypi.python.org/pypi/ijson/一起使用。
但是这里你的文件有.raw
扩展名,它不是json文件。
要阅读:
import numpy as np
content = np.fromfile("data.raw", dtype=np.int16, sep="")
但是这个解决方案可能会因大文件而崩溃。
如果事件.raw
似乎是.csv
文件,那么您可以像这样创建您的阅读器:
import csv
def read_big_file(filename):
with open(filename, "rb") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
yield row
或者喜欢taht for text file:
def read_big_file(filename):
with open(filename, "r") as _file:
for line in _file:
yield line
仅当您的文件是二进制文件时才使用rb
。
执行:
for line in read_big_file(filename):
<treatment>
<free memory after a size of chunk>
如果你给出文件的第一行,我可以准确地回答你。
答案 2 :(得分:0)
下面是一个示例json数据。它包含两个人的记录。但是也可能是一百万。下面的代码是一种解决方案,它逐行读取文件并一次从一个人那里检索数据,并将其作为json对象返回。
数据:
[
{
"Name" : "Joy",
"Address" : "123 Main St",
"Schools" : [
"University of Chicago",
"Purdue University"
],
"Hobbies" : [
{
"Instrument" : "Guitar",
"Level" : "Expert"
},
{
"percussion" : "Drum",
"Level" : "Professional"
}
],
"Status" : "Student",
"id" : 111,
"AltID" : "J111"
},
{
"Name" : "Mary",
"Address" : "452 Jubal St",
"Schools" : [
"University of Pensylvania",
"Washington University"
],
"Hobbies" : [
{
"Instrument" : "Violin",
"Level" : "Expert"
},
{
"percussion" : "Piano",
"Level" : "Professional"
}
],
"Status" : "Employed",
"id" : 112,
"AltID" : "M112"
}
}
]
代码: 导入json
curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
#Reading file line by line
line = fp.readline()
lnum = 0
while line:
for a in line:
if a == '{':
curly_idx.append(lnum)
first_curly_found = True
elif a == '}':
curly_idx.pop()
# when the right curly for every left curly is found,
# it would mean that one complete data element was read
if len(curly_idx) == 0 and first_curly_found:
jstr = f'{jstr}{line}'
jstr = jstr.rstrip()
jstr = jstr.rstrip(',')
jstr[:-1]
print("------------")
if len(jstr) > 10:
print("making json")
j = json.loads(jstr)
print(jstr)
jstr = ""
line = fp.readline()
lnum += 1
continue
if first_curly_found:
jstr = f'{jstr}{line}'
line = fp.readline()
lnum += 1
if lnum > 100:
break