我有一个胖日志文件。需要处理日志文件并获取IP地址(例如第一列)。
我尝试了两种方法。
- 逐行阅读-简单
- 线程处理-块日志文件,每个线程10万行。
日志文件示例:
> 64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables
> HTTP/1.1" 401 12846
> 64.242.88.11 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200
> 4523
> 64.242.88.12 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
方法1:
start_time = time.time()
with open('access.log', 'r') as f:
for line in f:
print(line.split(' ')[0].strip())
end_time = time.time()
print("Execution Time: " + str( end_time - start_time))
方法2:
import threading
import time
chunk_size = 100000
output = set() # result set
thread_required = 0 # condition check required on EOF
max_threads = 5
current_chunk_size = 0 #start position of a log file
file_name = 'access.log'
def read_lines_chunk(file_name, start_pos, chunk_size):
global thread_required
with open(file_name, 'r') as f:
lines = f.readlines()[start_pos:(start_pos + chunk_size)]
if not lines:
thread_required = 1
print("no more threads needed")
return
for line in lines:
ip = line.split(' ')[0].strip()
output.add(ip)
start_time = time.time()
while ( thread_required != 1):
for i in range(max_threads):
t = threading.Thread(target=read_lines_chunk, args=(file_name,current_chunk_size,chunk_size-1))
current_chunk_size = current_chunk_size + chunk_size
t.start()
t.join()
end_time = time.time()
print("Execution Time: " + str( end_time - start_time))
当我执行此代码时,需要花费很多时间。调试后发现lines = f.readlines()[start_pos:(start_pos + chunk_size)]
这行代码实际上是罪魁祸首。我该如何解决这个问题?