如何使此脚本从链接名称中获取“nmv-fas”并创建具有该名称的目录,然后将下载的所有文件放在该目录中。
all.html:
<a href="http://www.youversion.com/bible/gen.45.nmv-fas">http://www.youversion.com/bible/gen.45.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.46.nmv-fas">http://www.youversion.com/bible/gen.46.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.47.nmv-fas">http://www.youversion.com/bible/gen.47.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.48.nmv-fas">http://www.youversion.com/bible/gen.48.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.49.nmv-fas">http://www.youversion.com/bible/gen.49.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.50.nmv-fas">http://www.youversion.com/bible/gen.50.nmv-fas</a>
<a href="http://www.youversion.com/bible/exod.1.nmv-fas">http://www.youversion.com/bible/exod.1.nmv-fas</a>
<a href="http://www.youversion.com/bible/exod.2.nmv-fas">http://www.youversion.com/bible/exod.2.nmv-fas</a>
<a href="http://www.youversion.com/bible/exod.3.nmv-fas">http://www.youversion.com/bible/exod.3.nmv-fas</a>
保存在名为
的文件夹中的文件nmv-fas
蟒:
import lxml.html as html
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup
import re
root = html.parse(open('all.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
f = urllib.urlopen(url)
s = f.read()
f.close()
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
open(name, 'w').write(converted)
答案 0 :(得分:1)
您可以使用lxml
模块解析文件中的链接,然后使用urllib
下载每个链接。阅读链接可能如下所示:
import lxml.html as html
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
您可以使用urllib.urlopen
下载指向文件的链接:
import urllib
import urlparse
# extract the final path component and use it as
# the local filename.
name = urlparse.urlparse(url).path.split('/')[-1]
fd = urllib.urlopen(url)
open(name, 'w').write(fd.read())
把这些放在一起,你应该有类似你想要的东西。