问题:如何从Python中的多个文件(包括压缩的gz文件和未压缩的文件)中搜索关键字 我在一个文件夹中有多个存档的日志,最新的文件是“邮件”,而较旧的日志将自动压缩为.gz文件。
-rw ------- 1根root 21262610 Nov 4 11:20消息
-rw ------- 1根root 3047453 Nov 2 15:49 messages-20191102-1572680982.gz
-rw ------- 1根root 3018032 Nov 3 04:43 messages-20191103-1572727394.gz
-rw ------- 1个根目录3026617 Nov 3 17:32 messages-20191103-1572773536.gz
-rw ------- 1根root 3044692 Nov 4 06:17 messages-20191104-1572819469.gz
我写了一个函数:
但是我认为这种方式不是很聪明,因为实际上消息日志很大,并且被分成多个gz文件。而且我在关键字文件中存储了很多关键字。
因此,有更好的解决方案将所有文件连接到I / O流中,然后从流中提取关键字。
def open_all_message_files(path):
files_list=[]
for root, dirs, files in os.walk(path):
for file in files:
if file.startswith("messages"):
files_list.append(os.path.join(root,file))
for x in files_list:
if x.endswith('gz'):
with gzip.open(x,"r") as f:
for line in f:
if b'keywords_1' in line:
print(line)
if b'keywords_2' in line:
print(line)
else:
with open(x,"r") as f:
for line in f:
if 'keywords_1' in line:
print(line)
if 'keywords_2' in line:
print(line)
答案 0 :(得分:0)
这是我在stackoverflow中的第一个答案,所以请多多包涵。 我遇到了一个非常相似的问题,需要分析多个日志,其中一些日志非常庞大,无法完全放入内存中。 解决此问题的方法是创建一个数据处理管道,类似于unix / linux管道。背后的想法是将每个任务分解为各自的功能,并使用生成器来实现内存效率更高的方法。
import os
import gzip
import re
import fnmatch
def find_files(pattern, path):
"""
Here you can find all the filenames that match a specific pattern
using shell wildcard pattern that way you avoid hardcoding
the file pattern i.e 'messages'
"""
for root, dirs, files in os.walk(path):
for name in fnmatch.filter(files, pattern):
yield os.path.join(root, name)
def file_opener(filenames):
"""
Open a sequence of filenames one at a time
and make sure to close the file once we are done
scanning its content.
"""
for filename in filenames:
if filename.endswith('.gz'):
f = gzip.open(filename, 'rt')
else:
f = open(filename, 'rt')
yield f
f.close()
def chain_generators(iterators):
"""
Chain a sequence of iterators together
"""
for it in iterators:
# Look up yield from if you're unsure what it does
yield from it
def grep(pattern, lines):
"""
Look for a pattern in a line
"""
pat = re.compile(pattern)
for line in lines:
if pat.search(line):
yield line
# A simple way to use these functions together
logs = find_files('messages*', 'One/two/three')
files = file_opener(logs)
lines = chain_generators(files)
each_line = grep('keywords_1', lines)
for match in each_line:
print(match)
如果您对我的回答有任何疑问,请告诉我