使用Mechanical中的Mechanicalize下载网页上的所有链接

时间:2014-07-01 23:51:27

标签: python mechanize

我试图按照以下线程来回答我的问题。它是一个很好的例子,展示了如何使用Mechanize下载网页上的所有链接:

Download all the links(related documents) on a webpage using Python

我按照发布的代码(即):

import mechanize
from time import sleep
#Make a Browser (think of this as chrome or firefox etc)
br = mechanize.Browser()

#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
#for more ways to set up your br browser object e.g. so it look like mozilla
#and if you need to fill out forms with passwords.

# Open your site
br.open('http://pypi.python.org/pypi/xlwt')

f=open("source.html","w")
f.write(br.response().read()) #can be helpful for debugging maybe

filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
myfiles=[]
for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
    for t in filetypes:
        if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
            myfiles.append(l)


def downloadlink(l):
    f=open(l.text,"w") #perhaps you should ensure that file doesn't already exist.

    br.click_link(l)
    f.write(br.response().read())
    print l.text," has been downloaded"
    #br.back()

for l in myfiles:
    sleep(1) #throttle so you dont hammer the site
    downloadlink(l)

我只改变了:

f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.

要:

f=open('C:\\l.text',"w") #perhaps you should open in a better way & ensure that file doesn't already exist.

这使代码对我有用,否则它给了我一个错误。当我运行代码时,我得到以下输出:

Download> xlwt-0.7.5.tar.gz has been downloaded 
xlwt-0.7.5.tar.gz has been downloaded

所以它奏效了。但我不知道这个文件下载到哪里?有任何想法吗?我搜索了我的C盘,却找不到它。

如果代码运行为:

f=open(l.text,"w")

它引发了以下异常:

Traceback (most recent call last):
  File "C:\Python27\mech.py", line 33, in <module>
downloadlink(l)
  File "C:\Python27\mech.py", line 25, in downloadlink
f=open(l.text,"w") #perhaps you should ensure that file doesn't already exist.
IOError: [Errno 22] invalid mode ('w') or filename: 'Download> <span style="font-size: 75%">xlwt-0.7.5.tar.gz<span>'

1 个答案:

答案 0 :(得分:2)

您引用的Python代码使用链接text的{​​{1}}属性(因此表达式l)作为文件名。因此(因为每个链接应该有一个不同的l.text属性值)代码应该生成许多文件,每个链接一个。

您的更改用常量替换变量表达式(每个链接具有不同值的表达式)。因此,每个文件都以text的形式写入C:\目录。因此,当您查看此文件时,您应该会看到页面上最后一个链接的上下文。

(顺便说一句,我知道不是你的错,但是l.text是一个非常糟糕的变量名称,因为它可能会与数字混淆)。

运行此程序的正确方法是在具有写入权限的空目录(否则单个文件很难跟踪)内。如果任何文件名包含斜杠,那么您将不得不特别注意创建必要的目录结构或将它们转换为可接受的Windows文件名。

您可能还希望用更通俗的内容替换检测代码。

l