尝试1

with open(json_file, encoding='UTF-8') as myfile:
    for line in myfile:
        try:
            line_contents = json.loads(line)
            temp = pd.DataFrame.from_dict(flatten_json(line_contents), orient='index').transpose()
            for col in temp.columns:
                if col not in data.columns:        
                    data[col] = np.NaN 
            data = data.append(temp)
        except:
            continue

但是此代码失败，因为出于某种原因，for循环甚至无法处理文件中的行，这是我不了解的。

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-18-b3526001dc66> in <module>()
      4     data = data.drop(data.index[[0]])
      5 with open(json_file, encoding='UTF-8') as myfile:
----> 6     for line in myfile:
      7         try:
      8             line_contents = json.loads(line)

C:\ProgramData\Anaconda3\lib\codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 3615: invalid start byte

尝试2

由于代码在处理文本行时遇到麻烦，因此我尝试使用try-except来筛选文件中容易出错的行。

with open(json_file, encoding='UTF-8') as myfile:
    try:
        for line in myfile:
            line_contents = json.loads(line)
            temp = pd.DataFrame.from_dict(flatten_json(line_contents), orient='index').transpose()
            for col in temp.columns:
                if col not in data.columns:        
                    data[col] = np.NaN 
            data = data.append(temp)
    except:
        pass

但是这也不起作用，因为当出现错误时，它只是跳过了整个循环的其余部分。

尝试3

with open(json_file, encoding='UTF-8') as myfile:
    for i in range(10000):
        try:
            line = next(myfile)
            line_contents = json.loads(line)
            temp = pd.DataFrame.from_dict(flatten_json(line_contents), orient='index').transpose()
            for col in temp.columns:
                if col not in data.columns:        
                    data[col] = np.NaN 
            data = data.append(temp)
        except:
            continue

这种方法的问题在于，我不知道文件中有多少行。我尝试将其设置为15,000,000之类的大数字，但它从未终止

问题：我可以将try-except放在哪里，这样它会跳过有错误的行，并且还构造了for循环，以便它将遍历文件中的每一行？

Answer 1

您的尝试2已结束。您只需要将try移到for内，这样就只跳过一个循环迭代（那一行），而不是整个循环（整个文件）。

但是，没有必要像手动尝试3那样围绕手动调用for来重写next，因为您尝试处理从文件中读取行时的错误，而并非尝试处理错误解码错误的UTF-8或解析JSON。

实际上，您通常希望使try尽可能地窄，而不是尽可能地宽，因此您不会意外地吞下您不希望并希望吞咽的错误。而且，出于同样的原因，您几乎从不需要裸露的except:语句。

处理JSON错误很容易，但是如何处理编码错误呢？一个选项是直接进行解码，因此您可以try狭义地进行解码：

with open(json_file, mode='rb') as myfile:
    for line in myfile:
        try:
            line_contents = json.loads(line.decode())
        except (UnicodeDecodeError, JSONDecodeError):
            continue
        temp = pd.DataFrame.from_dict(flatten_json(line_contents), orient='index').transpose()
        for col in temp.columns:
            if col not in data.columns:        
                data[col] = np.NaN 
        data = data.append(temp)

但是，更简单地说：loads可以直接接受UTF-8 bytes：

        try:
            line_contents = json.loads(line)
        except (UnicodeDecodeError, JSONDecodeError):
            continue

（如果您使用的不是Python 3.6或更高版本，请参见您的loads版本的文档，而不是3.6文档-此行应该有效，但是为什么它的工作原理与众不同...）

这种方法的问题在于，我不知道文件中有多少行。我尝试将其设置为一个大数字，例如1500万，但它从未终止。

如上所述，您不需要这样做。

但是，如果您这样做，我将解释问题所在以及如何处理。

到达文件末尾时，next(myfile)将引发StopIteration。但是，您只用except:就可以了，然后继续进行下一行。它将再次引发StopIteration。等等。因此，如果您有100万行，则在到达文件末尾后必须经过1400万个except:循环。

这就是为什么您不想要裸露的except:的原因。一种选择是仅对其进行更改，以使StopIteration不会被捕获。您可以单独捕获它，并使用它来打破循环：

try: line = next(file) except StopIteration: break try: line_contents = json.loads(line) except JSONDecodeError: continue

另一种替代方法是使用file.readline()代替next(file)。 readline方法将在EOF返回一个空字符串，但否则绝不会返回一个空字符串（空白行仍为'\n'）。所以：

line = file.readline() if notline: break try: line_contents = json.loads(line) except JSONDecodeError: continue

当然，无论哪种方式，您都不再需要猜测长度。代替for i in range(15000000):，只需执行while True:。

但是随后您在while True:周围有一个line = next(file)，except StopIteration: break就是for line in file:首先要做的，所以…那个。

最后：确定要真的默默忽略所有非UTF-8行吗？

可能只是数据是垃圾数据-每个JSON文本使用不同的编码，其中大部分用UTF-8编码，而另一些用其他编码，并且未在任何地方指定编码带内还是带外，所以实际上没有好的答案。（尽管即使这样，当UTF-8失败时，您可能仍要尝试使用chardet或unicodedammit或其他启发式猜测器。）

但是，如果您的数据使用的是Latin-1，则您正在做的事情就是忽略任何非英语的内容。找出数据以Latin-1格式进行解码会更加有用。

那应该由您的资料证明。如果不是这样，chardet或unicodedammit之类的库可能会帮助您进行猜测（当然，对于手动猜测而言，它们甚至比自动猜测还要好）。如果您无法解决问题，则可以记录错误（例如，记录异常和行的repr），而不是静静地丢弃错误，然后返回Stack Overflow寻求帮助日志中的信息。

Answer 2

您必须真正解决问题，这与json解码完全无关。

您可以在错误回溯中看到

      5 with open(json_file, encoding='UTF-8') as myfile:
----> 6     for line in myfile:

您的错误发生在for行中，甚至早于json.loads！

错误UnicodeDecodeError表示文件内容不是您指定的utf-8。您可以尝试指定其他编码，也可以在打开文件时传递ignore参数以忽略这些错误：

with open(json_file, encoding='UTF-8', errors='ignore') as myfile:

这将在解码时删除未知字节，因此它将丢失，但不会引发错误。

尝试2和3在try子句中进行行解码，因此它们将掩盖真正的错误，即文本解码。

Python-将try-except与for循环一起使用，以避免读取文本文件时出错

尝试1

尝试2

尝试3

2 个答案: