使用线程刮取网站时出错

时间:2018-01-13 13:30:58

标签: python multithreading csv web-scraping python-multithreading

我正在尝试构建一个机器人,它将废弃购买域名的购买历史记录。到目前为止,我能够从csv文件中提取域并将它们存储到列表中(PS:有10k个域)。当我试图废弃网站时,问题就出现了。我试过用两个域做这个,它完美无缺。有谁知道这是什么错误以及如何解决它?非常感谢你提前。

我的代码:

datafile = open('/Users/.../Documents/Domains.csv', 'r')
myreader = csv.reader(datafile, delimiter=";",)
domains   = []
for row in myreader:
    domains.append(row[1])
del domains[0]
print("The Domains have been stored into a list")

nmb_sells_record = 0

def result_catcher(domains,queue):
    template_url = "https://namebio.com/{}".format(domain)
    get = requests.get(template_url)
    results = get.text
    last_sold =  results[results.index("last sold for ")+15:results.index(" on 2")].replace(",","")
    last_sold = int(last_sold)
    if not "No historical sales found." in results:
        sold_history = results[results.index("<span class=\"label label-success\">"):results.index(" USD</span> on <span class=\"label")]
    queue.put(results)

#domains = ["chosen.com","koalas.com"]
queues = {}
nmb=0
for x in range(len(domains)):
    new_queue = "queue{}".format(nmb)
    queues[new_queue] = queue.Queue()
    nmb += 1
count = 0
for domain in domains:
    for queue in queues: 
        t = threading.Thread(target=result_catcher, args=(domain,queues[queue]))
        t.start()
print("The Requests were all sent, now they are beeing analysed")   
for queue in queues:
    response_domain = queues[queue].get()
    nmb_sells_record = response_domain.count("for $") + response_domain.count("USD")


print("The Bot has recorded {} domain sells".format(nmb_sells_record))

我的代码输出:

Exception in thread Thread-345:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 743, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x115a55a20>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known

1 个答案:

答案 0 :(得分:1)

来自python docs

  

exception socket.gaierror OSError的子类,getaddrinfo()和getnameinfo()为与地址相关的错误引发了这个异常。

     

伴随值是表示错误的一对(错误,字符串)   库调用返回。 string表示对的描述   错误,由gai_strerror()C函数返回。数字错误   value将匹配此模块中定义的一个EAI_ *常量。

gai =&gt;获取地址信息

来自urllib3 wikipage

  

新异常:NewConnectionError ,在我们无法建立新连接时引发,通常是ECONNREFUSED套接字错误。

ECONNREFUSED错误here的一些可能原因以及一些用于探测地址和端口的命令行命令。

顺便说一句,不是将所有行都读入数组,而是删除数组中的第一项,这使得python将所有其他项目滑动到一个点上,您可以更有效地跳过标题(?),就像这样:

myreader = csv.reader(datafile, delimiter=";",)
next(my_reader)  #<==== HERE ****

domains   = []

for row in myreader:
    domains.append(row[1])
如果没有下一行,

next()将抛出StopIteration异常。如果你想阻止它,你可以调用next(my_reader, None),如果没有下一行,它将返回None。

线程示例:

import requests
import threading

resources = [
    "dfactory.com",
    "dog.com",
    "cat.com",
]

def result_catcher(resource):
    template_url = "https://namebio.com/{}".format(resource)
    get = requests.get(template_url)


threads = []

for resource in resources:
    t = threading.Thread(target=result_catcher, args=(resource,) )
    t.start()
    threads.append(t)

for thread in threads:
    thread.join()

print("All threads done executing.")

顺便说一句,将有一个最佳线程数要启动,小于N.创建一个线程池,当一个线程完成后,它返回并从工作队列中读取另一个资源路径。您必须运行一些测试来确定最佳线程数。创建10,000个线程并不是最佳选择。如果你有四个核心,那么最少只有10个线程。