Question

我有一些JSON格式的日志文件，我正在复制到S3，所以我可以使用Elastic Map Reduce对它们运行Hive查询。我用来将日志文件复制到S3的脚本是用Python编写的。

每隔一段时间我就遇到一个包含不完整行的文件，通常在文件的末尾。这会导致需要该文件的任何Hive查询失败。我已经通过删除坏行来手动修复文件，但我想将此步骤集成到我的Python脚本中以防止这些失败。

以下是我正在使用的文件类型的示例：

{"logLine":{"browserName":"FireFox","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0"}}
{"logLine":{"browserName":"Pre","userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.24 (KHTML, like Gecko; Google Web Preview) Chrome/11.0.696 Safari/534.24"}}
{"logLine":{"browserName":"Internet Explorer","userAgent":"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1

在这种情况下，我想删除最后一行，因为它不完整。我知道它不完整，因为它缺少行尾字符，也因为缺少结束引号和花括号而无效JSON。

有没有一种简单的方法可以使用Python从文件中识别和删除该文件？

Answer 1

Python的标准库中有一个json模块。它有一个解析器，如果输入是无效的JSON，它将引发异常。要检查最后一行，您可以执行类似

的操作

import json
with open('log.txt') as file:
    lines = file.readlines()
try:
    json.loads(lines[-1])
except ValueError:
    with open('log.txt', 'w') as file:
        file.write(''.join(lines[:-1]))

Answer 2

我会在下面使用这个例子。请注意，它将整个文件加载到内存中，因此如果文件很大，那么您可以通过逐行加载文件来完成。

import json
with open('log.txt') as file:
    lines = file.readlines()

towrite = ''
for line in lines:
    try:
        towrite += json.dumps(json.loads(line)) + '\n'
    except ValueError:
        pass
with open('log.txt', 'w') as file:
    file.write(towrite)

Answer 3

你可以抓住每一行并通过过滤函数传递它们。

此功能类似于

def isLineComplete(line):
    return line[-1] == "}"

概述：

myFile = ...

cleanLines = filter(isLineComplete, myFile.readlines())

Answer 4

假设您可以隔离线条，请按以下方式检查：

try:
    json.loads('{"logLine":{"browserName":"Internet Explorer","userAgent":"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1');
except:
    #code to remove line from file

Answer 5

您可以使用json.loads尝试解析每一行并忽略引发异常的行

lines = """{"logLine":{"browserName":"FireFox"}}
{"logLine":{"browserName":"Pre"}}
{"logLine":{"browserName":"Internet Explorer"
"""
cleaned = []
for line in lines.splitlines():
    try:
        json.loads(line)
    except ValueError:
        continue
    cleaned.append(line)
    print cleaned

使用Python从JSON格式的日志文件末尾删除不完整的行

5 个答案: