Question

我有几个非常大的不太csv的日志文件。

考虑到以下条件：

值字段包含未转义的换行符和逗号，几乎任何内容都可以在值字段中包含'='
每个有效行都有未知数量的有效值字段
有效值看起来像key=value，因此有效行看起来像key1=value1, key2=value2, key3=value3等。
每个有效行的开头应以eventId=<some number>,

读取文件的最佳方法是什么，将文件拆分为正确的行，然后将每行解析为正确的键值对？

我试过了

file_name = 'file.txt'
read_file = open(file_name, 'r').read().split(',\neventId')

这会正确解析第一个条目，但所有其他条目都以=#而不是eventId=#开头。有没有办法保持分隔符并拆分有效的换行符？

此外，速度非常重要。

示例数据：

eventId=123, key=value, key2=value2:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, key=value, key21=value=,

是的，文件真的很混乱（有时）这里的每个事件都有3个键值对，但实际上每个事件中都有未知数量的键值对。

Answer 1

如果每个有效行的开头应该以eventId = 开头，那么你可以将这些行分组并找到带有正则表达式的有效对：

from itertools import groupby
import re
with open("test.txt") as f:
    r = re.compile("\w+=\w+")
    grps = groupby(f, key=lambda x: x.startswith("eventId="))
    d = dict(l.split("=")  for k, v in grps if k
             for l in r.findall(next(v))[1:])
    print(d)
    {'key3': 'value3', 'key2': 'value2', 'key1': 'value1', 'goodkey': 'goodvalue'}

如果你想保留eventIds：

import re
with open("test.txt") as f:
    r = re.compile("\w+=\w+")
    grps = groupby(f, key=lambda x: x.startswith("eventId="))
    d = list(r.findall(next(v)) for k, v in grps if k)
    print(d)
[['eventId=123', 'goodkey=goodvalue', 'key2=somestuff'], ['eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3']]

从您的描述中不清楚输出应该是什么，如果您想要所有valids键=值对，并且每个有效行的开头应该以eventId = 开头是不准确的：< / p>

from itertools import groupby,chain
import re
def parse(fle):
    with open(fle) as f:
        r = re.compile("\w+=\w+")
        grps = groupby(f, key=lambda x: x.startswith("eventId="))
        for k, v in grps:
            if k:
                sub = "".join((list(v)) + list(next(grps)[1]))
                yield from r.findall(sub)

print(list(parse("test.txt")))

输出：

['eventId=123', 'key=value', 'key2=value2', 'anotherkey=anothervalue',   
'eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3', 
'eventId=12345', 'key=value', 'key21=value']

Answer 2

这个问题非常疯狂，但这是一个似乎有效的解决方案。始终使用现有库输出格式化数据，孩子们。

# Gestion www
server {
    # Port
    listen 80;
    # Hostname
    server_name test.mywebsite.lol;
    # Logs
    access_log /var/log/nginx/test.mywebsite.lol.access.log;
    error_log /var/log/nginx/test.mywebsite.lol.error.log;
    root /home/mywebsite/www/test;
    # Fichier a executer par defaut (en ordre)
    index index.html index.php;

    location / {
        try_files $uri $uri/ /index.php?$query_string;
    }

    # pass the PHP scripts to FastCGI server listening on the php-fpm socket
    location ~ \.php$ {
        try_files $uri =404;
        fastcgi_pass unix:/var/run/php5-fpm.sock;
        fastcgi_index index.php;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        include fastcgi_params;
    }

    location ~ /\. {
        deny all;
    }
}

输出import re; in_string = """eventId=123, goodkey=goodvalue, key2=somestuff: this, will, be, a problem, maybe?=, anotherkey=anothervalue, gotit=see, the problem===s, eventId=1234, key1=value1, key2=value2, key3=value3, eventId=12345, key1= msg= {this is not a valid key value pair}, validkey=validvalue,""" line_matches = list(re.finditer(r'(,\n)?eventId=\d', in_string)) lines = [] for i in range(len(line_matches)): match_start = line_matches[i].start() next_match_start = line_matches[i+1].start() if i < len(line_matches)-1 else len(in_string)-1 line = in_string[match_start:next_match_start].lstrip(',\n') lines.append(line) lineDicts = [] for line in lines: d = {} pad_line = ', '+line matches = list(re.finditer(r', [\w\d]+=', pad_line)) for i in range(len(matches)): match = matches[i] key = match.group().lstrip(', ').rstrip('=') next_match_start = matches[i+1].start() if i < len(matches)-1 else len(pad_line) value = pad_line[match.end():next_match_start] d[key] = value lineDicts.append(d) print lineDicts

Answer 3

如果你的值真的可以包含任何东西，那么就没有明确的解析方法。任何key=value对都可以是前一个值的一部分。即使新行上的eventID=#对也可能是上一行中值的一部分。

现在，如果您假设值永远不会包含有效的key=子字符串，那么尽管存在歧义，也许您可以对数据进行“足够好”的解析。如果你知道可能的键（或者至少它们有什么约束，比如字母数字），那么猜测什么是新键以及什么只是前一个值的一部分会更容易。

无论如何，如果我们假设所有字母数字字符串后跟等号都是键，我们可以使用正则表达式进行解析。遗憾的是，没有简单的方法可以逐行执行此操作，也没有一种方法可以在单次扫描中捕获所有键值对。但是，扫描一次以获取日志行（可能已嵌入换行符）并且然后分别为每个行获取key=value,对并不太难。

with open("my_log_file") as infile:
    text = infile.read()

line_pattern = r'(?S)eventId=\d+,.*?(?:$|(?=\neventId=\d+))'
kv_pattern = r'(?S)(\w+)=(.*?),\s*(?:$|(?=\w+=))'
results = [re.findall(kv_pattern, line) for line in re.findall(line_pattern, text)]

我假设文件足够小，可以作为字符串放入内存中。如果不能一次处理所有文件，解决问题会更加令人讨厌。

如果我们在示例文本上运行此正则表达式匹配，我们得到：

[[('eventId', '123'), ('key', 'value'), ('key2', 'value2:\nthis, will, be, a problem,\nmaybe?='), ('anotherkey', 'anothervalue')],
 [('eventId', '1234'), ('key1', 'value1'), ('key2', 'value2'), ('key3', 'value3')],
 [('eventId', '12345'), ('key1', '\nmsg= {this is not a valid key value pair}'), ('key', 'value'), ('key21', 'value=')]]

由于问号，

maybe?不被视为关键。 msg和最终value不被视为关键字，因为没有逗号将它们与之前的值分开。

Answer 4

哦！这是一个有趣的问题，你需要分别处理每一行和部分行，而不是多次迭代文件。

data_dict = {}
file_lines = open('file.txt','r').readlines()
for line in file_lines:
    line_list = line.split(',')
    if len(line_list)>=1:
        if 'eventId' in line_list[0]:
            for item in line_list:
                pair = item.split('=')
                data_dict.update({pair[0]:pair[1]})

应该这样做。享受！

如果'伪csv'中有空格，请将最后一行更改为：

data_dict.update({pair[0].split():pair[1].split()})

为了从键和值的字符串中删除空格。

P.S。如果这回答了您的问题，请单击左侧的复选标记将其记录为已接受的答案。谢谢！

p.p.s。来自实际数据的一组行在编写内容时非常有用，可以避免错误情况。

在Python

4 个答案: