Question

我有一些Python代码可以从联合国教科文组织网站上删除数据。它运行得很好，但如果获取任何页面时出错，则再次调用抓取数据的函数，并抓取页面。不幸的是，页面被抓了两次，我不知道为什么。

完整代码可用here。但导致问题的功能如下：

country_code_list = [["AFG"],["ALA"],["DZA"],["ALB"]]
countries = {"AFG":"Afghanistan","ALA":"Aland Islands","ALB":"Albania","DZA":"Algeria"}
base_url = "http://www.unesco.org/xtrans/bsresult.aspx?lg=0&c="

def get_page(self, url, country, all_books, thread_no, sleep_time=0):
    time.sleep(sleep_time)

    try: 
        target_page = urllib2.urlopen(url)
        if sleep_time != 0:
            print("Thread {0} successfully fetched {1}"\
                  .format(self.thread_no, url))
    except Exception, error:
        print("Thread {0} Error getting {1} while processing {2}: ".format\
              (thread_no, url, country), error)
        self.get_page(url, country, all_books, thread_no, (sleep_time + 1))

    page = BeautifulSoup(target_page, parse_only=only_restable)
    books = page.find_all('td',class_="res2")
    for book in books:
        all_books.append(Book (book,country))
    page.decompose()    

    for title in all_books:
        title.export(country)

与该功能交互的唯一其他代码是遍历网页的代码，该代码在这里，但我不认为这是问题所在：

    def build_list(self, code_list, countries, thread):
    '''  Build the list of all the books, and return a list of Book objects
    in case you want to do something with them in something else, ever.'''
    for country in code_list:

        print('Thread {0} now processing {1} \n'.format(self.thread_no, \
                                                        countries[country]))
        results_total = self.get_total_results(country, base_url)

        with open(count_file, "a") as count_table: 
            print(country + ": " + str(results_total), file=count_table)

        for page_num in range(0,results_total,10):
            all_books = []
            url = base_url + country + "&fr=" + str(page_num)
            try: 
                self.get_page(url, country, all_books, self.thread_no)
            except Exception, error:
                print("Thread {0} Error getting {1} while processing {2}: "\
                      .format(self.thread_no, url, country), error)
                self.get_page(url, country, all_books, self.thread_no, 1)
    print("Thread {0} completed.".format(self.thread_no))

Answer 1

在您的异常代码之后，添加return语句：

except Exception, error:
    print("Thread {0} Error getting {1} while processing {2}: ".format\
          (thread_no, url, country), error)
    self.get_page(url, country, all_books, thread_no, (sleep_time + 1))
    return

否则，它将继续处理失败的页面。

每次失败后网站下载两次

1 个答案: