Python socket.gethostbyname_ex()多线程失败

时间:2012-02-08 13:43:38

标签: python multithreading hostname resolve

我编写了一个脚本,该脚本应该使用多线程将多个主机名解析为IP地址。

然而,它失败并在某个随机点冻结。怎么解决这个问题?

num_threads = 100
conn = pymysql.connect(host='xx.xx.xx.xx', unix_socket='/tmp/mysql.sock', user='user', passwd='pw', db='database')
cur = conn.cursor()
def mexec(befehl):
    cur = conn.cursor()
    cur.execute(befehl)

websites = ['facebook.com','facebook.org' ... ... ... ...] \#10.000 websites in array
queue = Queue()
def getips(i, q):
    while True:
        #--resolve IP--
        try:
            result = socket.gethostbyname_ex(site)
            print(result)
            mexec("UPDATE sites2block SET ip='"+result+"', updated='yes' ") #puts site in mysqldb
        except (socket.gaierror):
            print("no ip")
            mexec("UPDATE sites2block SET ip='no ip', updated='yes',")
        q.task_done()
#Spawn thread pool
for i in range(num_threads):
    worker = Thread(target=getips, args=(i, queue))
    worker.setDaemon(True)
    worker.start()
#Place work in queue
for site in websites:
    queue.put(site)
#Wait until worker threads are done to exit
queue.join()

3 个答案:

答案 0 :(得分:3)

您可以使用sentinel值来表示没有工作的线程并加入线程而不是queue.task_done()queue.join()

#!/usr/bin/env python
import socket
from Queue import Queue
from threading import Thread

def getips(queue):
    for site in iter(queue.get, None):
        try: # resolve hostname
            result = socket.gethostbyname_ex(site)
        except IOError, e:
            print("error %s reason: %s" % (site, e))
        else:
            print("done %s %s" % (site, result))

def main():
    websites = "youtube google non-existent.example facebook yahoo live".split()
    websites = [name+'.com' for name in websites]

    # Spawn thread pool
    queue = Queue()
    threads = [Thread(target=getips, args=(queue,)) for _ in range(20)]
    for t in threads:
        t.daemon = True
        t.start()

    # Place work in queue
    for site in websites: queue.put(site)
    # Put sentinel to signal the end
    for _ in threads: queue.put(None)
    # Wait for completion
    for t in threads: t.join()

main()

gethostbyname_ex()功能已过时。要支持IPv4 / v6地址,您可以使用socket.getaddrinfo()代替。

答案 1 :(得分:1)

我的第一个想法是,由于DNS过载而导致错误 - 也许您的解析器不允许每次执行超过一定数量的查询。


此外,我发现了一些问题:

  1. 您忘记在site循环中正确分配while - 这可能最好由在队列上迭代的for循环替换。在您的版本中,您使用模块级命名空间中的site变量,这可能导致查询变为双倍而其他查询被跳过。

    在此处,您可以控制队列是否仍有条目或等待某些条目。如果两者都没有,你可以退出你的主题。

  2. 出于安全原因,您最好

    def mexec(befehl, args=None):
        cur = conn.cursor()
        cur.execute(befehl, args)
    

    以后再做

    mexec("UPDATE sites2block SET ip=%s, updated='yes'", result) #puts site in mysqldb
    

  3. 为了与未来的协议保持兼容,您应该使用socket.getaddrinfo()代替socket.gethostbyname_ex(site)。在那里,您可以获得所需的所有IP(首先,您可以限制为IPv4,但是更容易切换到IPv6)并且可以将它们全部放入数据库中。


    对于您的队列,代码示例可能是

    def queue_iterator(q):
        """Iterate over the contents of a queue. Waits for new elements as long as the queue is still filling."""
        while True:
            try:
                item = q.get(block=q.is_filling, timeout=.1)
                yield item
                q.task_done() # indicate that task is done.
            except Empty:
                # If q is still filling, continue.
                # If q is empty and not filling any longer, return.
                if not q.is_filling: return
    
    def getips(i, q):
        for site in queue_iterator(q):
            #--resolve IP--
            try:
                result = socket.gethostbyname_ex(site)
                print(result)
                mexec("UPDATE sites2block SET ip=%s, updated='yes'", result) #puts site in mysqldb
            except (socket.gaierror):
                print("no ip")
                mexec("UPDATE sites2block SET ip='no ip', updated='yes',")
    # Indicate it is filling.
    q.is_filling = True
    #Spawn thread pool
    for i in range(num_threads):
        worker = Thread(target=getips, args=(i, queue))
        worker.setDaemon(True)
        worker.start()
    #Place work in queue
    for site in websites:
        queue.put(site)
    queue.is_filling = False # we are done filling, if q becomes empty, we are done.
    #Wait until worker threads are done to exit
    queue.join()
    

    应该这样做。


    另一个问题是您并行插入MySQL。您只能一次执行一个MySQL查询。因此,您可以通过threading.Lock()RLock()保护访问权限,也可以将答案放入另一个由另一个线程处理的队列中,该队列甚至可以捆绑它们。

答案 2 :(得分:0)

您可能会发现直接使用concurrent.futuresthreadingmultiprocessingQueue更简单:

#!/usr/bin/env python3
import socket
# pip install futures on Python 2.x
from concurrent.futures import ThreadPoolExecutor as Executor

hosts = "youtube.com google.com facebook.com yahoo.com live.com".split()*100
with Executor(max_workers=20) as pool:
     for results in pool.map(socket.gethostbyname_ex, hosts, timeout=60):
         print(results)

注意:您可以轻松地从使用线程切换到进程:

from concurrent.futures import ProcessPoolExecutor as Executor

如果gethostbyname_ex()在您的操作系统上不是线程安全的,例如it might be the case on OSX,则需要它。

如果您要处理gethostbyname_ex()中可能出现的异常:

import concurrent.futures

with Executor(max_workers=20) as pool:
    future2host = dict((pool.submit(socket.gethostbyname_ex, h), h)
                       for h in hosts)
    for f in concurrent.futures.as_completed(future2host, timeout=60):
        e = f.exception()
        print(f.result() if e is None else "{0}: {1}".format(future2host[f], e))

类似于the example from the docs