我正在努力获取我的代码,从网上抓取HTML表信息,通过ShipURL.txt文件中保存的网站列表。代码从ShipURL读取网页地址,然后转到链接并下载表数据并将其保存到csv。但我的问题是程序无法完成,因为错误&#34; 连接尝试失败,因为连接方在一段时间后没有正确响应,或者建立的连接失败,因为连接的主机无法响应< /强>&#34;发生在中间,程序停止。现在据我所知,我需要增加请求时间,使用代理或make try语句。我已经扫描了一些关于同样问题的答案,但作为一个新手我发现很难理解。任何帮助将不胜感激。
ShipURL.txt https://dl.dropboxusercontent.com/u/110612863/ShipURL.txt
# -*- coding: utf-8 -*-
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines()
import csv
from urllib import urlopen
from bs4 import BeautifulSoup
import re
for line in Shiplinks:
website = re.findall(r'(https?://\S+)', line)
website = "".join(str(x) for x in website)
if website != "":
with open('ShipData.csv','wb')as f: #Creates an empty csv file to which assign values.
writer = csv.writer(f)
shipUrl = website
shipPage = urlopen(shipUrl)
soup = BeautifulSoup(shipPage, "html.parser") #Read the web page HTML
table = soup.find_all("table", { "class" : "table1" }) #Finds table with class table1
List = []
columnRow = ""
valueRow = ""
Values = []
for mytable in table: #Loops tables with class table1
table_body = mytable.find('tbody') #Finds tbody section in table
try: #If tbody exists
rows = table_body.find_all('tr') #Finds all rows
for tr in rows: #Loops rows
cols = tr.find_all('td') #Finds the columns
i = 1 #Variable to control the lines
for td in cols: #Loops the columns
## print td.text #Displays the output
co = td.text #Saves the column to a variable
## writer.writerow([co]) Writes the variable in CSV file row
if i == 1: #Checks the control variable, if it equals to 1
if td.text[ -1] == ":":
# võtab kooloni maha ja lisab koma järele
columnRow += td.text.strip(":") + "," # Tekkis mõte, et vb oleks lihtsam kohe ühte string panna
List.append(td.text) #.. takes the column value and assigns it to a list called 'List' and..
i+=1 #..Increments i by one
else:
# võtab reavahetused maha ja lisab koma stringile
valueRow += td.text.strip("\n") + ","
Values.append(td.text) #Takes the second columns value and assigns it to a list called Values
#print List #Checking stuff
#print Values #Checking stuff
except:
print "no tbody"
# Prindime pealkirjad ja väärtused koos reavahetusega välja ka :)
print columnRow.strip(",")
print "\n"
print valueRow.strip(",")
# encode'ing hakkas jälle kiusama
# Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
writer.writerow([columnRow.encode('utf-8')])
writer.writerow([valueRow.encode('utf-8')])
答案 0 :(得分:1)
我会用try / catch包装你的urlopen调用。像这样:
try:
shipPage = urlopen(shipUrl)
except Error as e:
print e
至少可以帮助您找出错误发生的位置。如果没有额外的文件,则很难排除故障,否则。
答案 1 :(得分:0)
网站通过阻止从单个IP的连续访问来保护自己免受DDOS攻击。
您应该在每次访问之间或每次访问10次或20或50次之间设置一个休眠时间。
或者您可能需要通过tor网络或任何其他方式匿名访问
答案 2 :(得分:0)
在此链接上找到了一些很棒的信息: How to retry after exception in python? 这基本上是我的连接问题所以我决定尝试直到它成功。目前正在运作。 解决了这段代码的问题:
while True:
try:
shipPage = urllib2.urlopen(shipUrl,timeout=5)
except Exception as e:
continue
break
但我感谢大家在这里,你帮助我更好地理解了这个问题!