使用python标准库从HTML文件中提取图像

时间:2017-06-09 20:57:46

标签: python html python-3.x

所以我正在尝试编写一个基本上解析HTML文件的脚本,找到所有图像并将这些图像保存到另一个文件夹中。当你在计算机上安装python3时,如何使用python3附带的库来实现这一目标?我目前有这个脚本,我想加入更多。

date = datetime.date.today()
backup_path = os.path.join(str(date), language)
if not os.path.exists(backup_path):
    os.makedirs(backup_path)

log = []

endpoint = zendesk + '/api/v2/help_center/en-us/articles.json'
while endpoint:
    response = requests.get(endpoint, auth=credentials)
if response.status_code != 200:
    print('Failed to retrieve articles with error {}'.format(response.status_code))
    exit()
data = response.json()

for article in data['articles']:
    if article['body'] is None:
        continue
    title = '<h1>' + article['title'] + '</h1>'
    filename = '{id}.html'.format(id=article['id'])
    with open(os.path.join(backup_path, filename), mode='w', encoding='utf-8') as f:
        f.write(title + '\n' + article['body'])

    print('{id} copied!'.format(id=article['id']))

    log.append((filename, article['title'], article['author_id']))

endpoint = data['next_page']

这是我在zendesk论坛上发现的一个脚本,基本上支持我们关于Zendesk的文章。

1 个答案:

答案 0 :(得分:2)

尝试使用漂亮的汤来检索所有节点,并使用urllib获取每个节点以获取图片。

from bs4 import BeautifulSoup

#note here using response.text to get raw html
soup = BeautifulSoup(response.text)

#get the src of all images
img_source = [x.src for x in soup.find_all("img")]

#get the images
images = [urllib.urlretrieve(x) for x in img_source]

您可能需要添加一些错误处理并稍微更改以适合您的页面,但这个想法保持不变。