Question

我非常是python的新手（＆lt; 2周），并被要求读取我提供的200k + JSON文件（原样）到一个数据库（使用python）。这些JSON文件具有平坦的一级属性，这些属性在文件中从50开始变化 - > 1000，但那50个是1000的子集。

以下是json文件的片段：

{
"study_type" : "Observational",
"intervention.intervention_type" : "Device",
"primary_outcome.time_frame" : "24 months",
"primary_completion_date.type" : "Actual",
"design_info.primary_purpose" : "Diagnostic",
"design_info.secondary_purpose" : "Intervention",
"start_date" : "January 2014",
"end_date" : "March 2014",
"overall_status" : "Completed",
"location_countries.country" : "United States",
"location.facility.name" : "Generic Institution",
}

我们的目标是获取这些JSON文件的主数据库，清理各个列，对这些列运行描述性统计信息并创建最终的清理数据库。

我来自SAS背景，所以我的想法是使用pandas并创建一个（非常）大的数据帧。上周我一直在梳理堆栈溢出，我已经利用了一些知识，但我觉得必须有一种方法可以让这种方式更有效率。

下面是我到目前为止编写的代码 - 它运行但非常慢（我估计即使在消除了以“result”开头的不需要的输入属性/列之后也需要几天甚至几周）。

此外，我将字典转换为最终表的笨拙方式使列索引号高于列名，我无法弄清楚如何删除。

import json, os
import pandas as pd    
from copy import deepcopy

path_to_json = '/home/ubuntu/json_flat/'

#Gets list of files in directory with *.json suffix
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#Initialize series
df_list = []

#For every json file found
for js in list_files:

    with open(os.path.join(path_to_json, js)) as data_file:
        data = json.loads(data_file.read())                         #Loads Json file into dictionary
        data_file.close()                                           #Close data file / remove from memory

        data_copy = deepcopy(data)                                  #Copies dictionary file
        for k in data_copy.keys():                                  #Iterate over copied dictionary file
            if k.startswith('result'):                              #If field starts with "X" then delete from dictionary
                del data[k]
        df = pd.Series(data)                                        #Convert Dictionary to Series
        df_list.append(df)                                          #Append to empty series  
        database = pd.concat(df_list, axis=1).reset_index()         #Concatenate series into database

output_db = database.transpose()                                    #Transpose rows/columns
output_db.to_csv('/home/ubuntu/output/output_db.csv', mode = 'w', index=False)

任何想法，建议都非常感谢。我完全愿意完全使用不同的技术或方法（在python中），如果它更有效并且仍然允许我们实现上述目标。

谢谢！

Answer 1

我试图以更简洁的方式复制您的方法，减少副本和追加。它适用于您提供的示例数据，但不知道数据集中是否还有其他复杂数据。你可以尝试一下，希望评论有所帮助。

import json
import os
import pandas
import io


path_to_json = "XXX"

list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#set up an empty dictionary
resultdict = {}

for fili in list_files:
    #the with avoids the extra step of closing the file
    with open(os.path.join(path_to_json, fili), "r") as inputjson:
        #the dictionary key is set to filename here, but you could also use e.g. a counter
        resultdict[fili] = json.load(inputjson)
        """
        you can exclude stuff here or later via dictionary comprehensions: 
        http://stackoverflow.com/questions/1747817/create-a-dictionary-with-list-comprehension-in-python
        e.g. as in your example code
        resultdict[fili] = {k:v for k,v in json.load(inputjson).items() if not k.startswith("result")}
        """

#put the whole thing into the DataFrame     
dataframe = pandas.DataFrame(resultdict)

#write out, transpose for desired format
with open("output.csv", "w") as csvout:
    dataframe.T.to_csv(csvout)

Answer 2

您最关键的性能错误可能就是：

database = pd.concat(df_list, axis=1).reset_index()

你在循环中执行此操作，每次向df_list添加一个东西然后再次连接。但是直到最后才使用这个“数据库”变量，因此你可以在循环之外只执行一次这个步骤。

对于Pandas来说，循环中的“concat”是一个巨大的反模式。在循环中构建列表，连续一次。

第二件事是您应该使用Pandas来读取JSON文件：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

保持简单。编写一个带路径的函数，调用pd.read_json()，删除不需要的行（series.str.startswith()）等。

一旦你运行良好，下一步将是检查你是CPU限制（CPU使用率100％），还是I / O限制（CPU使用率远低于100％）。

Python：将200k JSON文件读入Pandas Dataframe

2 个答案: