Question

很抱歉，如果这是重复的话，我一直在寻找答案约一个小时，但似乎找不到任何答案。无论如何，我有一个充满URL的文本文件，我想检查每个文件是否存在。我需要一些帮助来理解错误消息，以及是否有任何方法可以修复它或可以使用其他方法。

这是我的代码

import requests

filepath = 'url.txt'  
with open(filepath) as fp:  
   url = fp.readline()
   count = 1
   while count != 677: #Runs through each line of my txt file
      print(url)
      request = requests.get(url) #Here is where im getting the error
      if request.status_code == 200:
          print('Web site exists')
      else:
        print('Web site does not exist')
      url = url.strip()
      count += 1

这是输出

http://www.pastaia.co

Traceback (most recent call last):
File "python", line 9, in <module>
requests.exceptions.ConnectionError: 
HTTPConnectionPool(host='www.pastaia.co%0a', port=80): Max retries exceeded 
with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection 
object at 0x7fca82769e10>: Failed to establish a new connection: [Errno -2] 
Name or service not known',))

Answer 1

我会提出一些想法来帮助您入门，整个职业都是围绕着蜘蛛开发而来的：）顺便说一句，http://www.pastaia.co似乎已经失败了。这就是技巧的重要部分，即如何在爬网时处理意外情况。准备？我们去...

import requests

filepath = 'url.txt'
with open(filepath) as fp:
    for url in fp:
        print(url)
        try:
            request = requests.get(url) #Here is where im getting the error
            if request.status_code == 200:
                print('Web site exists')
        except:
            print('Web site does not exist')

将其设为for循环，您只想循环整个文件吗？
以某种方式执行try和except，如果它由于某种原因而爆炸，则可能会出现诸如错误的DNS，未返回的200之类的很多原因，也许是在.pdf页面上，网络是狂野的西部。这样，代码就不会崩溃，您可以检查列表中的下一个站点，并只记录所需的错误。
您也可以在其中添加其他条件，也许页面需要一定的长度？仅仅因为它是response code 200并不总是意味着页面有效，只是网站返回了success，但这是一个不错的起点。
考虑将user-agent添加到您的请求中，您可能想模仿浏览器，或者让您的程序将自己标识为super bot 9000
如果您想进一步深入研究文本的爬取和解析，请使用beautifulsoup：https://www.crummy.com/software/BeautifulSoup/

Answer 2

该网站似乎未提供网络流量：http://www.pastaia.co

请求模块的get()函数很可能试图多次连接到url。最终它将达到自己的内部重试限制，此时它将引发ConnectionError异常。

我会将这一行包装在try-catch块中，以捕获错误（因此表明该网站不存在：

try:
    request = requests.get(url)
    if request.status_code == 200:
        print('Web site exists')
    else:
        print("Website returned response code: {code}".format(code=request.status_code))
except ConnectionError:
    print('Web site does not exist')
    continue;

使用python3检查网站是否存在

2 个答案: