我尝试使用请求和 beautifulsoup4 从here下载一堆pdf文件。这是我的代码:
import requests
from bs4 import BeautifulSoup as bs
_ANO = '2013/'
_MES = '01/'
_MATERIAS = 'matematica/'
_CONTEXT = 'wp-content/uploads/' + _ANO + _MES
_URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT
r = requests.get(_URL)
soup = bs(r.text)
for i, link in enumerate(soup.findAll('a')):
_FULLURL = _URL + link.get('href')
for x in range(i):
output = open('file[%d].pdf' % x, 'wb')
output.write(_FULLURL.read())
output.close()
我得到了AttributeError: 'str' object has no attribute 'read'
。
好的,我知道,但是...我如何从生成的网址下载?
答案 0 :(得分:8)
这会将页面中包含原始文件名的所有文件写入pdfs/
目录。
import requests
from bs4 import BeautifulSoup as bs
import urllib2
_ANO = '2013/'
_MES = '01/'
_MATERIAS = 'matematica/'
_CONTEXT = 'wp-content/uploads/' + _ANO + _MES
_URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT
# functional
r = requests.get(_URL)
soup = bs(r.text)
urls = []
names = []
for i, link in enumerate(soup.findAll('a')):
_FULLURL = _URL + link.get('href')
if _FULLURL.endswith('.pdf'):
urls.append(_FULLURL)
names.append(soup.select('a')[i].attrs['href'])
names_urls = zip(names, urls)
for name, url in names_urls:
print url
rq = urllib2.Request(url)
res = urllib2.urlopen(rq)
pdf = open("pdfs/" + name, 'wb')
pdf.write(res.read())
pdf.close()
答案 1 :(得分:5)
使用wget
可能会更容易,因为如果需要,您可以拥有full power of wget(用户代理,关注,忽略robots.txt ...):
import os
names_urls = zip(names, urls)
for name, url in names_urls:
print('Downloading %s' % url)
os.system('wget %s' % url)