Question

我有大约2000个JSON文件，我试图通过Python程序运行。当JSON文件格式不正确时会出现问题。（错误：ValueError: No JSON object could be decoded）反过来，我无法将其读入我的程序。

我目前正在做类似下面的事情：

for files in folder:
    with open(files) as f:
        data = json.load(f); # It causes an error at this part

我知道有用于验证和格式化JSON文件的离线方法，但有没有一种编程方式来检查和格式化这些文件？如果没有，是否有一个免费/廉价的替代方法来离线修复所有这些文件，即我只是在包含所有JSON文件的文件夹上运行程序，并根据需要格式化它们？

使用@reece的评论解决：

invalid_json_files = []
read_json_files = []
def parse():
    for files in os.listdir(os.getcwd()):
        with open(files) as json_file:
            try:
                simplejson.load(json_file)
                read_json_files.append(files)
            except ValueError, e:
                print ("JSON object issue: %s") % e
                invalid_json_files.append(files)
    print invalid_json_files, len(read_json_files)

原来我在我的工作目录中保存了一个非JSON格式的文件，这是我从中读取数据的地方。感谢您提供的有用建议。

Answer 1

内置JSON模块可用作验证器：

import json

def parse(text):
    try:
        return json.loads(text)
    except ValueError as e:
        print('invalid json: %s' % e)
        return None # or: raise

您可以使用以下方法使其与文件一起使用：

with open(filename) as f:
    return json.load(f)

而不是json.loads，您也可以在错误消息中包含文件名。

在Python 3.3.5上，对于{test: "foo"}，我得到：

invalid json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

和2.7.6：

invalid json: Expecting property name: line 1 column 2 (char 1)

这是因为正确的json是{"test": "foo"}。

处理无效文件时，最好不要再处理它们。您可以构建一个skipped.txt文件，列出包含错误的文件，以便手动检查和修复它们。

如果可能，您应该检查生成无效json文件的站点/程序，修复它然后重新生成json文件。否则，您将继续使用无效JSON的新文件。

如果不这样做，您将需要编写一个修复常见错误的自定义json解析器。有了这个，你应该把原始的源代码控制（或存档），这样你就可以看到并检查自动化工具修复的差异（作为健全性检查）。不明确的情况应该手工修复。

Answer 2

是的，有一些方法可以验证JSON文件是否有效。一种方法是使用JSON解析库，如果您提供的输入格式不正确，则会抛出异常。

try:
   load_json_file(filename)
except InvalidDataException: # or something
   # oops guess it's not valid

当然，如果你想修复它，你自然不能使用JSON加载器，因为它首先是无效的JSON。除非您使用的图书馆会自动为您修复内容，否则您可能不会有这个问题。

一种方法是手动加载文件并对其进行标记并尝试检测错误并尝试修复它们，但我确定有些情况下错误无法自动修复并且会最好抛出错误并要求用户修复他们的文件。

我自己没有写过JSON修复程序，所以我无法提供有关如何实际修复错误的详细信息。

但是我不确定修复所有错误是否是一个好主意，因为那时你已经假设你的修复是用户真正想要的。如果它是一个缺少的逗号或者他们有一个额外的尾随逗号，那么这可能没问题，但可能会出现用户想要的含糊不清的情况。

Answer 3

这是一个完整的 python3 示例，适用于偶然发现此答案的下一个 Python 新手程序员。我将 16000 条记录导出为 json 文件。我不得不多次重新启动该过程，因此在开始导入新系统之前，我需要验证所有 json 文件确实有效。

我不是 python 程序员，所以当我尝试上面写的答案时，什么也没发生。好像少了几行代码。下面的示例处理当前文件夹或特定文件夹中的文件。

verify.py

import json
import os
import sys
from os.path import isfile,join

# check if a folder name was specified
if len(sys.argv) > 1:
    folder = sys.argv[1]
else:
    folder = os.getcwd()

# array to hold invalid and valid files
invalid_json_files = []
read_json_files = []

def parse():
    # loop through the folder
    for files in os.listdir(folder):
        # check if the combined path and filename is a file
        if isfile(join(folder,files)):
            # open the file
            with open(join(folder,files)) as json_file:
                # try reading the json file using the json interpreter
                try:
                    json.load(json_file)
                    read_json_files.append(files)
                except ValueError as e:
                    # if the file is not valid, print the error 
                    #  and add the file to the list of invalid files
                    print("JSON object issue: %s" % e)
                    invalid_json_files.append(files)
    print(invalid_json_files)
    print(len(read_json_files))
parse()

示例：

python3 verify.py

或

python3 verify.py somefolder

使用 python 3.7.3 测试

Answer 4

我不清楚如何提供文件夹的路径，所以我想通过此选项提供答案。

path = r'C:\Users\altz7\Desktop\your_folder_name' # use your path
all_files = glob.glob(path + "/*.json")

data_list = []
invalid_json_files = []

for filename in all_files:
    try:
        df = pd.read_json(filename)
        data_list.append(df)
    except ValueError:
        invalid_json_files.append(filename)

print("Files in correct format: {}".format(len(data_list)))
print("Not readable files: {}".format(len(invalid_json_files)))
#df = pd.concat(data_list, axis=0, ignore_index=True) #will create pandas dataframe 
from readable files, if you like

验证并格式化JSON文件

4 个答案: