刮图像时FileNotFoundError

时间:2017-11-05 17:00:40

标签: python-3.x web-scraping beautifulsoup

我已经编写了这个脚本来从subreddit下载图像。

Mapper.Initialize(cfg => {
    cfg.CreateMap<A, C>().ForMember(c => c.Xs, c => c.MapFrom(r => r.Bs.Select(a => a.Id)));
});

但是我得到了这个FileNotFoundError。

# A script to download pictures from reddit.com/r/HistoryPorn
from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import re
import os
import sys #TODO: sys.argv

print('Downloading images...')

# Create a directory for photographs
path_to_hist = '/home/tautvydas/Documents/histphoto'
os.chdir(path_to_hist)
if not os.path.exists('/home/tautvydas/Documents/histphoto'):
    os.mkdir(path_to_hist)

website = 'https://www.reddit.com/r/HistoryPorn'

# Go to the internet and connect to the subreddit, start a loop
for i in range(3):
    subreddit = urlopen(website)
    bs_subreddit = BeautifulSoup(subreddit, 'lxml')

    # Create a regex and find all the titles in the page
    remove_reddit_tag = re.compile('(\s*\(i.redd.it\)(\s*))')
    title_bs_subreddit = bs_subreddit.findAll('p', {'class': 'title'})

    # Get text off the page
    pic_name = []
    for item in title_bs_subreddit[1:]:
        item = item.get_text()
        item = remove_reddit_tag.sub('', item)
        pic_name.append(item)

    # Get picture links
    pic_bs_subreddit = bs_subreddit.findAll('div', {'data-url' : re.compile('.*')})
    pic_img = []
    for pic in pic_bs_subreddit[1:]:
        pic_img.append(pic['data-url'])

    # Zip all info into one
    name_link = zip(pic_name, pic_img)
    for i in name_link:
        urlretrieve(i[1],i[0])


    # Click next
    for link in bs_subreddit.find('span', {'class' : 'next-button'}).children:
        website = link['href']

可能是什么问题? “data-url”中的链接可以很好地检索并在单击时起作用。这可能是名称包含超链接的问题吗?或者这个名字太长了?因为直到该图像所有其他图像都没有任何问题地下载。

1 个答案:

答案 0 :(得分:0)

此处的问题与收集的名称有关:它们将图片的来源包含为url字符串,并且它被误解为文件夹路径。

你需要清理文本以避免特殊烦人的字符,并可能使它们更短一点,但我建议也改变模式,以确保结果,你只能解析包含的<a>标签标题,而不是保存链接的整个<p>

此外,您可以通过搜索类thing(相当于findAll('div', {'data-url' : re.compile('.*'))创建一个主要列表,而不是使用两个不同的循环构建zip,然后使用此列表执行对每个块的相对查询以查找标题和URL。

[...]
remove_reddit_tag = re.compile('(\s*\(i.redd.it\)(\s*))')

name_link = []
for block in bs_subreddit.findAll('div', {'class': 'thing'})[1:]:
    item = block.find('a',{'class': 'title'}).get_text()
    title = remove_reddit_tag.sub('', item)[:100]

    url = block.get('data-url')
    name_link.append((title, url))
    print(url, title)

for title, url in name_link:
    urlretrieve(url, title)