用机械化来绕过404

时间:2012-11-24 13:12:58

标签: python csv beautifulsoup mechanize

我正在创建一个读取URL文件的Python脚本,但我知道并非所有这些都能正常工作。我试图弄清楚如何解决这个问题并让它读取文件的下一行,而不是引发我在下面发布的错误。我知道我需要某种if语句,但我无法弄明白。

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import csv

me = open('C:\Python27\myfile.csv')
reader = csv.reader(me)
mech = Browser()

for url in me:
    response =  mech.open(url)
    html = page.read()
    soup = BeautifulSoup(html)
    table = soup.find("table", border=3)

for row in table.findAll('tr')[2:]:
    col = row.findAll('td')
    BusinessName = col[0].string
    Phone = col[1].string
    Address = col[2].string
    City = col[3].string
    State = col[4].string
    Zip = col[5].string
    Restaurantinfo = (BusinessName, Phone, Address, City, State)
    print "|".join(Restaurantinfo)

当我运行该代码块时,会引发此错误:

  

httperror_seek_wrapper:HTTP错误404:未找到

基本上我要求的是如何让Python忽略它并尝试下一个URL。

1 个答案:

答案 0 :(得分:1)

如果你的文件中只有url,那么每行编写一个url并使用这样的代码会更简单:

from mechanize import Browser
from BeautifulSoup import BeautifulSoup


me = open('C:\Python27\myfile.csv')
mech = Browser()

for url in me.readlines():
    ...

如果你想保留你的代码,你必须使用:

for url in reader:
    ...