我正在使用每行包含一个json块的文件。每一行看起来像这样:
items
json的制作人选择了不必要的嵌套结构,而扁平结构则完全足够了。也就是说,我想将数据读入一个更明显的扁平结构中的Pandas DataFrame,该结构将包含列" a"," b"," c& #34;," d"," e"," index"。到目前为止,我提出的最好的方法是以不同的方式处理文件两次:
{"a":3,"b":10,"unnecessaryList":[{"value":12,"colName":"c"},{"value":792,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561653"}
有没有办法避免像这样两次读取数据? Pandas是否提供了一些可以避免我想出的concat / normalize / pivot / merge垃圾的功能?
答案 0 :(得分:1)
您可以使用:
df = pd.read_json('sample.json', lines=True)
#create Multiindex from 3 columns and select unnecessaryList for Series
s = df.set_index(['a','b','index'])['unnecessaryList']
print (s)
a b index
3 10 -1417561653 [{'value': 12, 'colName': 'c'}, {'value': 792,...
-1417561655 [{'value': 13, 'colName': 'c'}, {'value': 794,...
-1417561658 [{'value': 14, 'colName': 'c'}, {'value': 795,...
Name: unnecessaryList, dtype: object
#create DataFrame for each dict and concat, transpose
L = [pd.DataFrame(x).set_index('colName')['value'] for x in s]
df = (pd.concat(L, axis=1, keys=s.index)
.T
.reset_index(level=[0,1])
.rename(columns={'level_0':'a','level_1':'b'})
.rename_axis(None, 1))
print (df)
a b c d e
-1417561653 3 10 12 792 645
-1417561655 3 10 13 794 645
-1417561658 3 10 14 795 645
在json中输入数据:
{"a":3,"b":10,"unnecessaryList":[{"value":12,"colName":"c"},{"value":792,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561653"}
{"a":3,"b":10,"unnecessaryList":[{"value":13,"colName":"c"},{"value":794,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561655"}
{"a":3,"b":10,"unnecessaryList":[{"value":14,"colName":"c"},{"value":795,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561658"}