Python网页抓取[错误10060]

时间:2015-12-16 15:04:01

标签: python web-scraping runtime-error

我正在努力获取我的代码,从网上抓取HTML表信息,通过ShipURL.txt文件中保存的网站列表。代码从ShipURL读取网页地址,然后转到链接并下载表数据并将其保存到csv。但我的问题是程序无法完成,因为错误&#34; 连接尝试失败,因为连接方在一段时间后没有正确响应,或者建立的连接失败,因为连接的主机无法响应< /强>&#34;发生在中间,程序停止。现在据我所知,我需要增加请求时间,使用代理或make try语句。我已经扫描了一些关于同样问题的答案,但作为一个新手我发现很难理解。任何帮助将不胜感激。

ShipURL.txt https://dl.dropboxusercontent.com/u/110612863/ShipURL.txt

# -*- coding: utf-8 -*-
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines()

import csv
from urllib import urlopen
from bs4 import BeautifulSoup
import re
for line in Shiplinks:
    website = re.findall(r'(https?://\S+)', line)
    website = "".join(str(x) for x in website)
    if website != "":

    with open('ShipData.csv','wb')as f:                         #Creates an empty csv file to which assign values.
        writer = csv.writer(f)
        shipUrl = website
        shipPage = urlopen(shipUrl)

        soup = BeautifulSoup(shipPage, "html.parser")           #Read the web page HTML
        table = soup.find_all("table", { "class" : "table1" })  #Finds table with class table1
        List = []
        columnRow = ""
        valueRow = ""
        Values = []
        for mytable in table:                                   #Loops tables with class table1
            table_body = mytable.find('tbody')                  #Finds tbody section in table
            try:                                                #If tbody exists
                rows = table_body.find_all('tr')                #Finds all rows
                for tr in rows:                                 #Loops rows
                    cols = tr.find_all('td')                    #Finds the columns
                    i = 1                                       #Variable to control the lines
                    for td in cols:                             #Loops the columns
    ##                    print td.text                           #Displays the output
                        co = td.text                            #Saves the column to a variable
    ##                    writer.writerow([co])                 Writes the variable in CSV file row
                        if i == 1:                              #Checks the control variable, if it equals to 1

                            if td.text[ -1] == ":":
                                # võtab kooloni maha ja lisab koma järele
                                columnRow += td.text.strip(":") + "," # Tekkis mõte, et vb oleks lihtsam kohe ühte string panna
                                List.append(td.text)                #.. takes the column value and assigns it to a list called 'List' and..
                                i+=1                                #..Increments i by one

                        else:
                            # võtab reavahetused maha ja lisab koma stringile
                            valueRow += td.text.strip("\n") + ","
                            Values.append(td.text)              #Takes the second columns value and assigns it to a list called Values
                        #print List                             #Checking stuff
                        #print Values                           #Checking stuff


            except:
                print "no tbody"
        # Prindime pealkirjad ja väärtused koos reavahetusega välja ka :)
        print columnRow.strip(",")
        print "\n"
        print valueRow.strip(",")
        # encode'ing hakkas jälle kiusama
        # Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
        writer.writerow([columnRow.encode('utf-8')])
        writer.writerow([valueRow.encode('utf-8')])

3 个答案:

答案 0 :(得分:1)

我会用try / catch包装你的urlopen调用。像这样:

try:
  shipPage = urlopen(shipUrl)
except Error as e:
  print e

至少可以帮助您找出错误发生的位置。如果没有额外的文件,则很难排除故障,否则。

Python errors documentation

答案 1 :(得分:0)

网站通过阻止从单个IP的连续访问来保护自己免受DDOS攻击。

您应该在每次访问之间或每次访问10次或20或50次之间设置一个休眠时间。

或者您可能需要通过tor网络或任何其他方式匿名访问

答案 2 :(得分:0)

在此链接上找到了一些很棒的信息: How to retry after exception in python? 这基本上是我的连接问题所以我决定尝试直到它成功。目前正在运作。 解决了这段代码的问题:

 while True:
                try:
                    shipPage = urllib2.urlopen(shipUrl,timeout=5)
                except Exception as e:
                    continue
                break

但我感谢大家在这里,你帮助我更好地理解了这个问题!