Question

我正在努力获取我的代码，从网上抓取HTML表信息，通过ShipURL.txt文件中保存的网站列表。代码从ShipURL读取网页地址，然后转到链接并下载表数据并将其保存到csv。但我的问题是程序无法完成，因为错误＆＃34; 连接尝试失败，因为连接方在一段时间后没有正确响应，或者建立的连接失败，因为连接的主机无法响应< /强>＆＃34;发生在中间，程序停止。现在据我所知，我需要增加请求时间，使用代理或make try语句。我已经扫描了一些关于同样问题的答案，但作为一个新手我发现很难理解。任何帮助将不胜感激。

ShipURL.txt https://dl.dropboxusercontent.com/u/110612863/ShipURL.txt

# -*- coding: utf-8 -*- fm = open('ShipURL.txt', 'r') Shiplinks = fm.readlines() import csv from urllib import urlopen from bs4 import BeautifulSoup import re for line in Shiplinks: website = re.findall(r'(https?://\S+)', line) website = "".join(str(x) for x in website) if website != "": with open('ShipData.csv','wb')as f: #Creates an empty csv file to which assign values. writer = csv.writer(f) shipUrl = website shipPage = urlopen(shipUrl) soup = BeautifulSoup(shipPage, "html.parser") #Read the web page HTML table = soup.find_all("table", { "class" : "table1" }) #Finds table with class table1 List = [] columnRow = "" valueRow = "" Values = [] for mytable in table: #Loops tables with class table1 table_body = mytable.find('tbody') #Finds tbody section in table try: #If tbody exists rows = table_body.find_all('tr') #Finds all rows for tr in rows: #Loops rows cols = tr.find_all('td') #Finds the columns i = 1 #Variable to control the lines for td in cols: #Loops the columns ## print td.text #Displays the output co = td.text #Saves the column to a variable ## writer.writerow([co]) Writes the variable in CSV file row if i == 1: #Checks the control variable, if it equals to 1 if td.text[ -1] == ":": # võtab kooloni maha ja lisab koma järele columnRow += td.text.strip(":") + "," # Tekkis mõte, et vb oleks lihtsam kohe ühte string panna List.append(td.text) #.. takes the column value and assigns it to a list called 'List' and.. i+=1 #..Increments i by one else: # võtab reavahetused maha ja lisab koma stringile valueRow += td.text.strip("\n") + "," Values.append(td.text) #Takes the second columns value and assigns it to a list called Values #print List #Checking stuff #print Values #Checking stuff except: print "no tbody" # Prindime pealkirjad ja väärtused koos reavahetusega välja ka :) print columnRow.strip(",") print "\n" print valueRow.strip(",") # encode'ing hakkas jälle kiusama # Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused writer.writerow([columnRow.encode('utf-8')]) writer.writerow([valueRow.encode('utf-8')])

Answer 1

我会用try / catch包装你的urlopen调用。像这样：

try:
  shipPage = urlopen(shipUrl)
except Error as e:
  print e

至少可以帮助您找出错误发生的位置。如果没有额外的文件，则很难排除故障，否则。

Python errors documentation

Answer 2

网站通过阻止从单个IP的连续访问来保护自己免受DDOS攻击。

您应该在每次访问之间或每次访问10次或20或50次之间设置一个休眠时间。

或者您可能需要通过tor网络或任何其他方式匿名访问

Answer 3

在此链接上找到了一些很棒的信息： How to retry after exception in python? 这基本上是我的连接问题所以我决定尝试直到它成功。目前正在运作。解决了这段代码的问题：

 while True:
                try:
                    shipPage = urllib2.urlopen(shipUrl,timeout=5)
                except Exception as e:
                    continue
                break

但我感谢大家在这里，你帮助我更好地理解了这个问题！

Python网页抓取[错误10060]

3 个答案: