使用Python优雅地处理胖日志文件

时间:2019-07-02 10:45:28

标签: python multithreading

我有一个胖日志文件。需要处理日志文件并获取IP地址(例如第一列)。

我尝试了两种方法。

  
      
  1. 逐行阅读-简单
  2.   
  3. 线程处理-块日志文件,每个线程10万行。
  4.   

日志文件示例:

> 64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables
> HTTP/1.1" 401 12846
> 64.242.88.11 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200
> 4523
> 64.242.88.12 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291

方法1:

start_time = time.time()
with open('access.log', 'r') as f:
    for line in f:
        print(line.split(' ')[0].strip())
end_time = time.time()
print("Execution Time: " + str( end_time - start_time))

方法2:

import threading
import time

chunk_size = 100000
output = set() # result set
thread_required = 0 # condition check required on EOF
max_threads = 5 
current_chunk_size = 0 #start position of a log file

file_name = 'access.log'

def read_lines_chunk(file_name, start_pos, chunk_size):
    global thread_required
    with open(file_name, 'r') as f:
        lines = f.readlines()[start_pos:(start_pos + chunk_size)]
        if not lines:
            thread_required = 1
            print("no more threads needed")
            return
        for line in lines:
            ip = line.split(' ')[0].strip()
            output.add(ip)

start_time = time.time()
while ( thread_required != 1):
    for i in range(max_threads):
        t = threading.Thread(target=read_lines_chunk, args=(file_name,current_chunk_size,chunk_size-1))
        current_chunk_size = current_chunk_size + chunk_size
        t.start()
    t.join()
end_time = time.time()
print("Execution Time: " + str( end_time - start_time))

当我执行此代码时,需要花费很多时间。调试后发现lines = f.readlines()[start_pos:(start_pos + chunk_size)] 这行代码实际上是罪魁祸首。我该如何解决这个问题?

0 个答案:

没有答案