Question

我试图将大JSON（至少500MB）文件写入数据库。我编写了一个有效的脚本，并且内存友好，但速度非常慢。关于如何提高效率的任何建议？

我的JSON文件（从谷歌地球引擎中提取的遥感测量值）的格式如下：

{"type":"FeatureCollection","features":[{"geometry":{"coordinates":[-55.347046,-12.179673],"geodesic":true,"type":"Point"},"id":"LT52240692005129COA00_2","properties":{"B1":null,"B2":null,"B3":null,"B4":null,"B5":null,"B7":null,"description":"","id":0.0,"name":""},"type":"Feature"},{"geometry":{"coordinates":[-52.726481,-13.374343],"geodesic":true,"type":"Point"},"id":"LT52250692005184COA00_10","properties":{"B1":217,"B2":497,"B3":424,"B4":2633,"B5":1722,"B7":747,"description":"","id":8.0,"name":""},"type":"Feature"}]}

以下是读取JSON，解析并写入数据库的脚本。

import pandas as pd
import json
import sqlite3

# Variables
JSON_file = '../data/LT5oregon.geojson'
db_src = '../data/SR_ee_samples.sqlite'
table_name = 'oregon'
chunk_size = 5000

# Read JSON file
with open(JSON_file) as data_file:    
    data = json.load(data_file)

# Create database connection
con = sqlite3.connect(db_src)

# Create empty dataframe
df = pd.DataFrame()
# Initialize count for row index
count = 0

# Main loop
for feature in data['features']:
    json_feature = feature['properties']
    if json_feature['B1'] is not None:
        # Build metadata
        meta = feature['id'].split('_')
        meta_dict = {'scene_id': meta[0], 'feature_id': int(meta[1])}
        # Append meta data to feature data
        json_feature.update(meta_dict)
        # Append row to df
        df = df.append(pd.DataFrame(json_feature, index=[count]))
        count += 1
        if len(df) >= chunk_size: # When df reaches a certain number of rows, empty it to db
            df.to_sql(name = table_name, con = con, if_exists='append')
            df = pd.DataFrame()

# write remaining rows to db
df.to_sql(name = table_name, con = con, if_exists='append')

提前感谢任何建议

Answer 1

我认为您可以从分析器（例如line_profiler或the standard library）中受益，以评估代码的哪一部分需要时间。

我的赌注是关于数据帧的追加调用，我怀疑它必须每次复制整个数据帧以保持连续的数组（就像numpy一样）。也许制作一个元素列表，然后从中创建数据框？

有效地将JSON写入sqlite数据库

1 个答案: