Question

我正在使用Python中的Mechanize和Beautiful Soup进行刮刀处理，由于某些原因，重定向无法正常工作。这是我的代码（我为命名我的变量“东西”和“东西”道歉;我通常不这样做，相信我）：

stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
    for thing in stuff:
        pageUrl = thing['href']
        print pageUrl

        req = mechanize.Request(pageUrl)

        response = browser.open(req)

        searchPage = response.read()

        soup = BeautifulSoup(searchPage)
        soupString = soup.prettify()
        print soupString

无论如何，Kraft网站上有多个搜索结果页面的产品会显示下一页的链接。例如，源代码列出了this作为卡夫牛排调味汁和腌泡汁系列的下一页，重定向到this

无论如何，thing['href']中有旧的链接，因为它为它抓取了网页;人们会认为在该链接上执行browser.open()会导致机械化转到新链接并将其作为响应返回。但是，运行代码会得到以下结果：

http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out

我有时间;我想这是因为，由于某种原因，机械化正在寻找旧的URL并且没有被重定向到新的URL（我也尝试使用urllib2并获得相同的结果）。这是怎么回事？

感谢您的帮助，如果您需要更多信息，请与我们联系。

更新：好的，我启用了日志记录;现在我的代码是：

req = mechanize.Request(pageUrl)
print logging.INFO

当我运行它时，我得到了这个：

url参数不是URI（包含非法字符）u'http：//www.kraftrecipes.com/products/pages/productinfosearchresults.aspx？catalogtype = 1＆amp; brandid = 1＆amp; searchtext = a.1。牛排酱和腌料＆amp; pageno = 2' 20

更新2（在编写第一次更新时发生）：事实证明它是我的字符串中的空格！我所要做的只是：pageUrl = thing['href'].replace(' ', "+")并且它完美无缺。

Answer 1

默认情况下，urllib2和mechanize开启者都包含重定向响应的处理程序（您可以查看handlers属性），所以我认为问题不在于没有正确遵循重定向响应。

要解决问题，您应该在网络浏览器中捕获流量（在Firefox中，Live HTTP Headers和HttpFox对此有用）并将其与脚本中的日志进行比较（我d建议子类化urllib2.BaseHandler来创建自己的处理程序，以记录每个请求所需的所有信息，并使用add_handler方法将处理程序添加到开启者对象中。

Python Mechanize将无法正确处理重定向

1 个答案: