我正在尝试将Google云端硬盘中某个目录中的多个json文件加载到一个熊猫数据框中。
我已经尝试了很多解决方案,但是似乎没有什么能产生积极的结果。
这是我到目前为止尝试过的
path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data
这是我收到的错误消息的完整回溯:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']
/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder
/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
我添加了我要读取的json文件的示例
jsons示例:
{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}
答案 0 :(得分:1)
我检查了您的json文件,发现document1.json
,document2.json
和document3.json
中存在相同的问题:属性名称未用双引号引起来。
例如,document1.json
应该更正为:
{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}
编辑:您可以使用以下行在json文件的键中添加双引号:
re.sub("([^\s^\"]+):(.+)", '"\\1":\\2', s)
答案 1 :(得分:1)
在使用提供的JSON文件时遇到一些挑战,然后再将它们转换为数据框并进行合并。这是因为JSON的键不是字符串,其次,生成的“有效” JSONS的数组长度不同,无法直接转换为数据框,其次,您没有指定数据框的形状。
尽管如此,这是一个重要的问题,因为格式错误的JSON比“有效”格式更为常见,并且尽管有几种SO修复此类JSON字符串的方法,但每个格式错误的JSON问题都是独一无二的。
我将问题分为以下几部分:
注意:对于这个答案,我将您提供的示例JSON字符串复制到两个文件中,即“ test.json”和“ test1.json”,并将它们保存到“ Test”文件夹中。 / p>
第1部分:将文件中格式错误的JSON转换为有效的JSON :
您提供的两个示例JSON字符串没有任何数据类型。这是因为键不是字符串,并且无效。因此,即使您加载JSON文件并解析内容,也会出现错误。
with open('./Test/test.json') as f:
data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
我发现解决此问题的唯一方法是:
以上三个步骤是由我编写的两个功能完成的。第一个将文件重命名为txt文件,并返回文件名列表。第二个接受此文件名列表,使用正则表达式修复JSON密钥,然后将其再次保存为JSON格式。
import json
import os
import re
import pandas as pd
#rename to txt files and return list of filenames
def rename_to_text_files():
all_new_filenames = []
for filename in os.listdir('./Test'):
if filename.endswith("json"):
new_filename = filename.split('.')[0] + '.txt'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
all_new_filenames.append(new_filename)
else:
all_new_filenames.append(filename)
return all_new_filenames
#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
json_validated_files = []
for index, filename in enumerate(files):
filepath = os.path.join('./Test',filename)
with open(filepath,'r+') as f:
data = f.read()
dict_converted = re.sub("(\w+):(.+)", r'"\1":\2', data)
f.seek(0)
f.write(dict_converted)
f.truncate()
#rename
new_filename = filename[:-4] + '.json'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
json_validated_files.append(new_filename)
print("All files converted to valid JSON!")
return json_validated_files
因此,现在我有两个具有有效JSON的JSON文件。但是他们仍然没有准备好进行数据帧转换。为了更好地说明问题,请考虑来自“ test.json”的有效JSON:
#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}
如果我将json读入数据帧,我仍然会收到错误消息,因为每个键的数组长度都不同。您可以检查一下:“消息”键值是一个长度为1的数组,而“参与者”键的值是长度为2的数组:
df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length
在下一部分中,我们通过展平JSON中的字典来解决此问题。
第2部分:展平字典以进行数据帧转换:
由于您没有为数据框指定期望的形状,因此我以可能的最佳方式提取了值,并使用以下函数将字典展平。假设示例JSON中提供的密钥不会在所有JSON文件中都更改:
#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
values = []
d['participants_name'] = d.pop('participants')
for i in d['participants_name']:
values.append(i['name'])
for i in d['messages']:
d['messages_sender_name'] = i['sender_name']
d['messages_timestamp_ms'] = str(i['timestamp_ms'])
d['messages_content'] = i['content']
d['messages_type'] = i['type']
if "share" in i:
d['messages_share_link'] = i["share"]["link"]
d["is_still_participant"] = str(d["is_still_participant"])
d.pop('messages')
d.update(participants_name=values)
return d
这次,让我们考虑第二个示例JSON字符串,它也具有带有URL的“ share”键。有效的JSON字符串如下:
#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}
当我们使用上面的函数来平整此字典时,我们可以轻松地将其输入到DataFrame
函数中(稍后讨论):
with open('./Test/test1.json') as f:
data = json.load(f)
print(flatten(data))
#Output:
{'title': 'Clearance',
'is_still_participant': 'True',
'thread_type': 'Regular',
'thread_path': 'inbox/Clearance',
'participants_name': ['Clearance', 'Person'],
'messages_sender_name': 'Emmanuel Sibanda',
'messages_timestamp_ms': '1212242073308',
'messages_content': 'Dear',
'messages_type': 'Share',
'messages_share_link': 'http://www.example.com/'}
第3部分:创建数据框并将其合并为一个:
因此,既然我们有了一个可以平整字典的函数,我们可以在最终函数中调用该函数,我们将在其中进行以下操作:
json.load()
将每个JSON作为字典加载到内存中。pd.concat()
合并所有数据帧,并将数据帧列表作为参数传递。完成这些任务的代码:
#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.
def create_merge_dataframes(list_of_valid_json_files):
df_list = []
for index, js in enumerate(list_of_valid_json_files):
with open(os.path.join('./Test', js)) as json_file:
data = json.load(json_file)
flattened_json_data = flatten(data)
df = pd.DataFrame(flattened_json_data)
df_list.append(df)
merged_df = pd.concat(df_list,sort=False, ignore_index=True)
return merged_df
让我们对整个代码进行测试。我们从第1部分中的功能开始,到第3部分结束,以获取合并的ddataframe。
#rename invalid JSON files to text
files = rename_to_text_files()
#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)
#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)
最终数据框:
title is_still_participant thread_type thread_path \
0 Test 1 True Regular inbox/xyz
1 Test 1 True Regular inbox/xyz
2 Clearance True Regular inbox/Clearance
3 Clearance True Regular inbox/Clearance
participants_name messages_sender_name messages_timestamp_ms \
0 Test 1 Person 1485467319139
1 Person Person 1485467319139
2 Clearance Emmanuel Sibanda 1212242073308
3 Person Emmanuel Sibanda 1212242073308
messages_content messages_type messages_share_link
0 Hie Generic NaN
1 Hie Generic NaN
2 Dear Share http://www.example.com/
3 Dear Share http://www.example.com/
您可以根据需要更改列的顺序。
注意:
threading
和asyncio
库进行加速。但是,对于包含1000个文件的文件夹,此代码应该可以很好地工作,并且不会花很长时间。讨论的代码提供了完成您需要的工作流,希望对您和遇到类似问题的任何人有帮助。