对一个站点python的大量请求

时间:2016-11-02 13:26:20

标签: python sockets beautifulsoup urllib2

我现在正在学习请求是如何工作的限制,我遇到了问题,我使用Python(PyCharm)3.4,使用urllib.request BeautifulSoup的库, 现在我有一个网站,我正在对他进行测试, 我有这个功能:

def GetAll(url):
   req = urllib.request.Request(url, headers={'User-Agent': "Magic Browser"})
   html_page = urllib.request.urlopen(req)
   soup = BeautifulSoup(html_page, "html.parser")
   header = soup.find('h1', attrs={'sub': re.compile("subheader")}).string
   return header

它只是一个简单的函数,我发送一堆页面(网址),我做了一个`

while untill an index <= len(pages):
   header = GetAll(pages[index])
   print(header)

所以有些论坛有10~20页,有些论坛有100~300页,每页有50个科目, 但在一天结束时我回来看看列表是否已完成打印 有时它的打印100个标题有时是1000,有时高达5000我需要它在一个小数学30K主题后打印所有它,所以不得不做30K请求但我最终得到这个:

  Traceback (most recent call last):
  File "C:\Users\Bar\AppData\Local\Programs\Python\Python34\lib\urllib\request.py", line 1183, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Users\Bar\AppData\Local\Programs\Python\Python34\lib\http\client.py", line 1137, in request
    self._send_request(method, url, body, headers)
  File "C:\Users\Bar\AppData\Local\Programs\Python\Python34\lib\http\client.py", line 1182, in _send_request
    self.endheaders(body)
  File "C:\Users\Bar\AppData\Local\Programs\Python\Python34\lib\http\client.py", line 1133, in endheaders
    self._send_output(message_body)
  File "C:\Users\Bar\AppData\Local\Programs\Python\Python34\lib\http\client.py", line 963, in _send_output
    self.send(msg)
  File "C:\Users\Bar\AppData\Local\Programs\Python\Python34\lib\http\client.py", line 898, in send
    self.connect()
  File "C:\Users\Bar\AppData\Local\Programs\Python\Python34\lib\http\client.py", line 871, in connect
    self.timeout, self.source_address)
  File "C:\Users\Bar\AppData\Local\Programs\Python\Python34\lib\socket.py", line 498, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "C:\Users\Bar\AppData\Local\Programs\Python\Python34\lib\socket.py", line 537, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11004] getaddrinfo failed

0 个答案:

没有答案