Question

import json

with open("reverseURL.json") as file:
    file2 = json.load(file)

eagle = file2["eagle"]

sky = file2["sky"]

eagleAndSky = set(eagle).intersection(sky)

print(eagleAndSky.pop())

print(eagleAndSky.pop())

我正在尝试使用4.8 gbs的json文件运行此代码，但每次运行它时，它都冻结了我的计算机，我不知道该怎么做。 json文件包含在照片中用作键的标签，对于属性，它们是包含该标签的图像URL。当我在从测试和验证集创建的json文件上运行它时程序工作，因为它们很小但是当我在训练集的json文件上运行它时，它会冻结我的计算机，因为该文件很大，像4.8gb。 / p>

Answer 1

最简单的答案是获得更多内存。得到足够的数据来保存已解析的JSON，你就可以获得两套，而你的算法又会很快。

如果购买更多的RAM是不可能的，那么你需要制作一个不需要内存的算法。作为第一步，请考虑使用像ijson这样的热门JSON解析器。这将允许您只在内存中存储您关心的文件片段。假设您在eagle和sky中有很多重复项，单独执行此步骤可能会减少您的内存使用量，以便再次快速使用。这里有一些代码可供说明，您必须运行pip install ijson才能运行它：

from ijson import items

eagle = set()
sky = set()
with open("reverseURL.json") as file:
    for o in items(file, "eagle"):
        eagle.update(o)
    # Read the file again
    file.seek(0)
    for o in items(file, "sky"):
        sky.update(o)

eagleAndSky = eagle.intersection(sky)

如果使用ijson解析json作为一个蒸汽并没有足够的内存使用量，你必须将你的临时状态存储在磁盘上。 Python sqlite3模块非常适合此类工作。您可以创建一个临时文件数据库，其中包含一个用于eagle的表和一个用于天空的表，将所有数据插入到每个表中，添加唯一索引以删除重复数据（并在下一步中加速查询），然后加入表格让你的路口。这是一个例子：

import os
import sqlite3
from tempfile import mktemp
from ijson import items

db_path = mktemp(suffix=".sqlite3")
conn = sqlite3.connect(db_path)
c = conn.cursor()
c.execute("create table eagle (foo text unique)")
c.execute("create table sky (foo text unique)")
conn.commit()

with open("reverseURL.json") as file:
    for o in items(file, "eagle.item"):
        try:
            c.execute("insert into eagle (foo) values(?)", o)
        except sqlite3.IntegrityError:
            pass  # this is expected on duplicates
    file.seek(0)
    for o in items(file, "sky.item"):
        try:
            c.execute("insert into sky (foo) values(?)", o)
        except sqlite3.IntegrityError:
            pass  # this is expected on duplicates

conn.commit()

resp = c.execute("select sky.foo from eagle join sky on eagle.foo = sky.foo")
for foo, in resp:
    print(foo)

conn.close()
os.unlink(db_path)

在Python中读取4.8 GB Json文件

1 个答案: