我用Python编写了一个简单的脚本。
它解析来自网页的超链接,然后检索这些链接以解析一些信息。
我有类似的脚本运行并重新使用writefunction而没有任何问题,由于某种原因它失败了,我无法弄清楚原因。
General Curl init:
storage = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.USERAGENT, USER_AGENT)
c.setopt(pycurl.COOKIEFILE, "")
c.setopt(pycurl.POST, 0)
c.setopt(pycurl.FOLLOWLOCATION, 1)
#Similar scripts are working this way, why this script not?
c.setopt(c.WRITEFUNCTION, storage.write)
首先调用retreive链接:
URL = "http://whatever"
REFERER = URL
c.setopt(pycurl.URL, URL)
c.setopt(pycurl.REFERER, REFERER)
c.perform()
#Write page to file
content = storage.getvalue()
f = open("updates.html", "w")
f.writelines(content)
f.close()
... Here the magic happens and links are extracted ...
现在循环这些链接:
for i, member in enumerate(urls):
URL = urls[i]
print "url:", URL
c.setopt(pycurl.URL, URL)
c.perform()
#Write page to file
#Still the data from previous!
content = storage.getvalue()
f = open("update.html", "w")
f.writelines(content)
f.close()
#print content
... Gather some information ...
... Close objects etc ...
答案 0 :(得分:0)
如果要按顺序将URL下载到不同的文件(没有并发连接):
for i, url in enumerate(urls):
c.setopt(pycurl.URL, url)
with open("output%d.html" % i, "w") as f:
c.setopt(c.WRITEDATA, f) # c.setopt(c.WRITEFUNCTION, f.write) also works
c.perform()
注意:
storage.getvalue()
返回从创建时起写入storage
的所有内容。在您的情况下,您应该找到其中多个网址的输出open(filename, "w")
覆盖该文件(之前的内容已消失),即update.html
包含 last 中content
中的内容循环的迭代