Question

我是python的新手，我第一次使用它来处理pcap文件。到目前为止，我已经提供了一个程序，该程序可以过滤出属于特定IP和PROTOCOL的数据包，并将其写入新的pcap文件中。

    <input type="text" value="" />
    <input type="text" value="" />
    <input type="text" value="" />

    <button onclick="onSave()">Save</button>

    result: <span>
       <pre>
        <code></code>
       </pre>
     </span>

此代码花费了太长时间，因为它需要匹配的IP列表可能是1000个IP，目录中的pcap文件也可能是千兆字节。这就是为什么有必要引入多线程的原因。为此，我更改了以下代码；

from scapy.all import *
import re
import glob

def process_pcap(path, hosts, ports):
    pktdump = PcapWriter("temp11.pcap", append=True, sync=True)
    count=0;
    for pcap in glob.glob(os.path.join(path, '*.pcapng')):
        print "Reading file", pcap
        packets=rdpcap(pcap)
        for pkt in packets:
            if (TCP in pkt and (pkt[TCP].sport in ports or pkt[TCP].dport in ports)):
                if (pkt[IP].src in hosts or pkt[IP].dst in hosts):
                    count=count+1
                    print "Writing packets " , count
                    #wrpcap("temp.pcap", pkt)
                    pktdump.write(pkt)


path="\workspace\pcaps"
file_ip = open('ip_list.txt', 'r') #Text file with many ip address
o = file_ip.read()
hosts = re.findall( r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", o )
ports=[443] # Protocols to be added in filter
process_pcap(path, hosts, ports)

但是我认为我并没有以最好的方式做到这一点，因为时间并没有减少。

请提出任何建议！

编辑：

我已经根据响应更改了代码，虽然运行起来很糟糕，但是线程并未终止。 python中有关多线程的所有示例均不需要显式终止线程。请查明此代码中的问题；

from scapy.all import *
import re
import glob
import threading


def process_packet(pkt, pktdump, packets, ports):
count = 0
if (TCP in pkt and (pkt[TCP].sport in ports or pkt[TCP].dport in ports)):
            if (pkt[IP].src in hosts or pkt[IP].dst in hosts):
                count=count+1
                print "Writing packets " , count
                #wrpcap("temp.pcap", pkt)
                pktdump.write(pkt)  


def process_pcap(path, hosts, ports):
pktdump = PcapWriter("temp11.pcap", append=True, sync=True)
ts=list()
for pcap in glob.glob(os.path.join(path, '*.pcapng')):
    print "Reading file", pcap
    packets=rdpcap(pcap)
    for pkt in packets:
         t=threading.Thread(target=process_packet,args=(pkt,pktdump, packets,ports,))
         ts.append(t)
         t.start()
for t in ts:
    t.join()    


path="\workspace\pcaps"
file_ip = open('ip_list.txt', 'r') #Text file with many ip address
o = file_ip.read()
hosts = re.findall( r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", o )
ports=[443] # Protocos to be added in filter
process_pcap(path, hosts, ports)

这里是对以上程序的回应；它永远不会打印“退出主线程”

from scapy.all import *
import re
import glob
import threading
import Queue
import multiprocessing

#global variables declaration

path="\pcaps"
pcapCounter = len(glob.glob1(path,"*.pcapng")) #size of the queue
q = Queue.Queue(pcapCounter) # queue to hold all pcaps in directory
pcap_lock = threading.Lock()
ports=[443] # Protocols to be added in filter


def safe_print(content):
    print "{0}\n".format(content),

def process_pcap (hosts):
    content = "Thread no ", threading.current_thread().name, " in action"
    safe_print(content)
    if not q.empty():
        with pcap_lock:
            content = "IN LOCK ", threading.current_thread().name
            safe_print(content)
            pcap=q.get()

        content = "OUT LOCK", threading.current_thread().name, " and reading packets from ", pcap
        safe_print(content)   
        packets=rdpcap(pcap)


        pktdump = PcapWriter(threading.current_thread().name+".pcapng", append=True, sync=True)
        pList=[]
        for pkt in packets:
            if (TCP in pkt and (pkt[TCP].sport in ports or pkt[TCP].dport in ports)):
                if (pkt[IP].src in hosts or pkt[IP].dst in hosts):
                    pList.append(pkt)

                    content="Wrting Packets to pcap ", threading.current_thread().name
                    safe_print(content)
                    pktdump.write(pList) 


else:
    content = "DONE!! QUEUE IS EMPTY", threading.current_thread().name
    safe_print(content)


for pcap in glob.glob(os.path.join(path, '*.pcapng')):
    q.put(pcap)

file_ip = open('ip_list.txt', 'r') #Text file with many ip addresses
o = file_ip.read()
hosts = re.findall( r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", o )
threads = []
cpu = multiprocessing.cpu_count() 
for i in range(cpu):
    t = threading.Thread(target=process_pcap, args=(hosts,), name = i)
    t.start()
    threads.append(t)

for t in threads:
    t.join()


print "Exiting Main Thread"

编辑2：我在进行长度检查之前锁定了队列，但一切正常。

谢谢。

Answer 1

您正在为每个数据包创建一个线程。这是根本问题。

此外，您正在对每个处理的数据包执行I / O步骤，而不是编写一批数据包

您的PC上可能有1-10个内核。对于您正在处理的数据包数，创建1000个以上线程的开销超过了每个内核的并行度值。递减收益的法则非常快，拥有比可用核心更多的运行线程。

这是一种更好的方法，您将了解并行的好处。

主线程创建一个全局队列和锁，以供后续线程共享。在创建任何线程之前，主线程会枚举*.pcapng文件列表，并将每个文件名放入队列中。它还会读取IP地址列表以及用于过滤数据包的数据。

然后生成N个线程。其中N是设备上的内核数（N = {os.cpu_count()）。

每个线程输入一个锁，以从主线程建立的队列中弹出下一个文件，然后释放该锁。然后，线程将文件读入packets列表中，并删除不需要的文件。然后保存回一个单独的唯一文件，该文件代表原始输入文件的过滤结果。理想情况下，因为批量I / O操作可节省大量时间，所以pktdump对象一次支持回写多个数据包。

线程处理完一个文件后，它将重新输入锁，从队列中弹出下一个文件，释放锁，然后对下一个文件重复处理。

文件名队列为空时，线程退出。

主线程等待所有N个线程完成。现在，您拥有一整套要合并的K个文件。您的主线程只需要重新打开这些线程创建的K个文件，并将每个文件串联回一个输出文件即可。

Answer 2

这就是python与线程一起工作的方式，请参阅GIL。如果要并行执行，则应使用multiprocessing

多线程的引入不会减少Python程序的执行时间

2 个答案: