我试图将大JSON(至少500MB)文件写入数据库。我编写了一个有效的脚本,并且内存友好,但速度非常慢。关于如何提高效率的任何建议?
我的JSON文件(从谷歌地球引擎中提取的遥感测量值)的格式如下:
{"type":"FeatureCollection","features":[{"geometry":{"coordinates":[-55.347046,-12.179673],"geodesic":true,"type":"Point"},"id":"LT52240692005129COA00_2","properties":{"B1":null,"B2":null,"B3":null,"B4":null,"B5":null,"B7":null,"description":"","id":0.0,"name":""},"type":"Feature"},{"geometry":{"coordinates":[-52.726481,-13.374343],"geodesic":true,"type":"Point"},"id":"LT52250692005184COA00_10","properties":{"B1":217,"B2":497,"B3":424,"B4":2633,"B5":1722,"B7":747,"description":"","id":8.0,"name":""},"type":"Feature"}]}
以下是读取JSON,解析并写入数据库的脚本。
import pandas as pd
import json
import sqlite3
# Variables
JSON_file = '../data/LT5oregon.geojson'
db_src = '../data/SR_ee_samples.sqlite'
table_name = 'oregon'
chunk_size = 5000
# Read JSON file
with open(JSON_file) as data_file:
data = json.load(data_file)
# Create database connection
con = sqlite3.connect(db_src)
# Create empty dataframe
df = pd.DataFrame()
# Initialize count for row index
count = 0
# Main loop
for feature in data['features']:
json_feature = feature['properties']
if json_feature['B1'] is not None:
# Build metadata
meta = feature['id'].split('_')
meta_dict = {'scene_id': meta[0], 'feature_id': int(meta[1])}
# Append meta data to feature data
json_feature.update(meta_dict)
# Append row to df
df = df.append(pd.DataFrame(json_feature, index=[count]))
count += 1
if len(df) >= chunk_size: # When df reaches a certain number of rows, empty it to db
df.to_sql(name = table_name, con = con, if_exists='append')
df = pd.DataFrame()
# write remaining rows to db
df.to_sql(name = table_name, con = con, if_exists='append')
提前感谢任何建议
答案 0 :(得分:0)
我认为您可以从分析器(例如line_profiler
或the standard library)中受益,以评估代码的哪一部分需要时间。
我的赌注是关于数据帧的追加调用,我怀疑它必须每次复制整个数据帧以保持连续的数组(就像numpy一样)。也许制作一个元素列表,然后从中创建数据框?