Question

我有一个带有json结构的txt文件。问题在于该文件不仅包含json结构，而且还包含诸如日志错误之类的原始文本：

2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 : 
{
"name": "1111",
"results": [{
    "filename": "xxxx",
    "numberID": "7412"
}, {
    "filename": "xgjhh",
    "numberID": "E52"
}]
}

2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "jfkjgjkf",
"results": [{
    "filename": "hhhhh",
    "numberID": "478962"
}, {
    "filename": "jkhgfc",
    "number": "12544"
}]
}

我读取了.txt文件，但尝试修补jason结构时出现错误： IN：

import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
   json_data = json.load(f)

OUT：json.decoder.JSONDecodeError：额外数据：第1行第5列（字符4）

我想将json切片并另存为csv文件。

Answer 1

您可以执行以下其中一项操作：

在命令行上，删除所有行，例如“ | INFO | Technical |”出现（假设这出现在原始文本的每一行中）：
sed -i '' -e '/\|INFO\|Technical/d' yourfilename（如果在Mac上，则为
sed -i '/\|INFO\|Technical/d' yourfilename（如果在Linux上）。
将这些原始行移动到它们自己的JSON字段中

Answer 2

A至解析使用JSON文件更通用的解决方案的对象与其它内容混合没有非JSON内容的任何假设是分割文件的内容成片段由大括号，使用本身是一个开口的第一片段开始大括号，然后将其余的片段一个接一个地连接，直到连接的字符串可解析为JSON：

import re

fragments = iter(re.split('([{}])', f.read()))
while True:
    try:
        while True:
            candidate = next(fragments)
            if candidate == '{':
                break
        while True:
            candidate += next(fragments)
            try:
                print(json.loads(candidate))
                break
            except json.decoder.JSONDecodeError:
                pass
    except StopIteration:
        break

这将输出：

{'name': '1111', 'results': [{'filename': 'xxxx', 'numberID': '7412'}, {'filename': 'xgjhh', 'numberID': 'E52'}]}
{'name': 'jfkjgjkf', 'results': [{'filename': 'hhhhh', 'numberID': '478962'}, {'filename': 'jkhgfc', 'number': '12544'}]}

Answer 3

使用“文本结构”作为JSON对象之间的分隔符。

迭代的文件中的行，将它们保存到缓冲器中，直到遇到一个行是文本行，在该点解析线您保存作为JSON对象。

import re
import json

def is_text(line):
    # returns True if line starts with a date and time in "YYYY-MM-DD HH:MM:SS" format
    line = line.lstrip('|') # you said some lines start with a leading |, remove it
    return re.match("^(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})", line)

json_objects = []

with open("data.txt") as f:
    json_lines = []

    for line in f:
        if not is_text(line):
            json_lines.append(line)
        else:
            # if there's multiple text lines in a row json_lines will be empty
            if json_lines:
                json_objects.append(json.loads("".join(json_lines)))
                json_lines = []

    # we still need to parse the remaining object in json_lines
    # if the file doesn't end in a text line
    if json_lines:
        json_objects.append(json.loads("".join(json_lines)))

print(json_objects)

最后两行中的重复逻辑有点丑陋，但是您需要处理文件中的最后一行不是文本行的情况，因此在完成for循环后，您会发现如果存在，则需要解析位于json_lines中的最后一个对象。

我假设文本行之间的JSON对象不会超过一个，而且我的正则表达式的日期将在8000年后中断。

Answer 4

此解决方案将去除非JSON结构，并将其包装在包含JSON的结构中，这应该为您完成工作。为了方便起见，我将其发布，然后编辑答案以提供更清晰的说明。完成后，我将编辑第一部分：

import json

with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
    cleaned = ''.join([item.strip() if item.strip() is not '' else '-split_here-' for item in f.readlines() if '|INFO|' not in item]).split('-split_here-')

json_data = json.loads(json.dumps(('{"entries":[' + ''.join([entry + ', ' for entry in cleaned])[:-2] + ']}')))

输出：

{"entries":[{"name": "1111","results": [{"filename": "xxxx","numberID": "7412"}, {"filename": "xgjhh","numberID": "E52"}]}, {"name": "jfkjgjkf","results": [{"filename": "hhhhh","numberID": "478962"}, {"filename": "jkhgfc","number": "12544"}]}]}

这是怎么回事？

在cleaned = ...行中，我们使用list comprehension在文件（list）中创建f.readlines()行中不包含字符串{ {1}}，并在出现空白行时将字符串|INFO|添加到列表中（其中-split_here-产生.strip()）。

然后，我们将''行（list）转换为''.join()。

最后，我们将字符串（string转换为列表.split('-split_here-'，将JSON结构分离为自己的list，并在{{1}中用空行标记}。

在list行中，我们使用列表推导将'，'附加到每个JSON结构中。

然后，我们将data.txt转换回单个json_data = ...，剥离最后的list（string。', '最后两个字符的切片从字符串中。）

然后，我们用.join()[:-2]和[:-2]包装字符串以使整个内容成为有效的JSON结构，并将其提供给'{"entries":['和']}'以清除所有编码和将数据加载到python对象中。

Answer 5

您可以在文件中计算大括号以查找json的开头和结尾，并将其存储在列表中，found_jsons。

import json

open_chars = 0
saved_content = []

found_jsons = []

for i in content.splitlines():
    open_chars += i.count('{')

    if open_chars:
        saved_content.append(i)

    open_chars -= i.count('}')


    if open_chars == 0 and saved_content:
        found_jsons.append(json.loads('\n'.join(saved_content)))
        saved_content = []


for i in found_jsons:
    print(json.dumps(i, indent=4))

输出

{
    "results": [
        {
            "numberID": "7412",
            "filename": "xxxx"
        },
        {
            "numberID": "E52",
            "filename": "xgjhh"
        }
    ],
    "name": "1111"
}
{
    "results": [
        {
            "numberID": "478962",
            "filename": "hhhhh"
        },
        {
            "number": "12544",
            "filename": "jkhgfc"
        }
    ],
    "name": "jfkjgjkf"
}

在包含JSON和文本结构的txt文件中解析JSON结构

5 个答案: