我已经编写了这个脚本来从subreddit下载图像。
Mapper.Initialize(cfg => {
cfg.CreateMap<A, C>().ForMember(c => c.Xs, c => c.MapFrom(r => r.Bs.Select(a => a.Id)));
});
但是我得到了这个FileNotFoundError。
# A script to download pictures from reddit.com/r/HistoryPorn
from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import re
import os
import sys #TODO: sys.argv
print('Downloading images...')
# Create a directory for photographs
path_to_hist = '/home/tautvydas/Documents/histphoto'
os.chdir(path_to_hist)
if not os.path.exists('/home/tautvydas/Documents/histphoto'):
os.mkdir(path_to_hist)
website = 'https://www.reddit.com/r/HistoryPorn'
# Go to the internet and connect to the subreddit, start a loop
for i in range(3):
subreddit = urlopen(website)
bs_subreddit = BeautifulSoup(subreddit, 'lxml')
# Create a regex and find all the titles in the page
remove_reddit_tag = re.compile('(\s*\(i.redd.it\)(\s*))')
title_bs_subreddit = bs_subreddit.findAll('p', {'class': 'title'})
# Get text off the page
pic_name = []
for item in title_bs_subreddit[1:]:
item = item.get_text()
item = remove_reddit_tag.sub('', item)
pic_name.append(item)
# Get picture links
pic_bs_subreddit = bs_subreddit.findAll('div', {'data-url' : re.compile('.*')})
pic_img = []
for pic in pic_bs_subreddit[1:]:
pic_img.append(pic['data-url'])
# Zip all info into one
name_link = zip(pic_name, pic_img)
for i in name_link:
urlretrieve(i[1],i[0])
# Click next
for link in bs_subreddit.find('span', {'class' : 'next-button'}).children:
website = link['href']
可能是什么问题? “data-url”中的链接可以很好地检索并在单击时起作用。这可能是名称包含超链接的问题吗?或者这个名字太长了?因为直到该图像所有其他图像都没有任何问题地下载。
答案 0 :(得分:0)
此处的问题与收集的名称有关:它们将图片的来源包含为url字符串,并且它被误解为文件夹路径。
你需要清理文本以避免特殊烦人的字符,并可能使它们更短一点,但我建议也改变模式,以确保结果,你只能解析包含的<a>
标签标题,而不是保存链接的整个<p>
。
此外,您可以通过搜索类thing
(相当于findAll('div', {'data-url' : re.compile('.*')
)创建一个主要列表,而不是使用两个不同的循环构建zip,然后使用此列表执行对每个块的相对查询以查找标题和URL。
[...]
remove_reddit_tag = re.compile('(\s*\(i.redd.it\)(\s*))')
name_link = []
for block in bs_subreddit.findAll('div', {'class': 'thing'})[1:]:
item = block.find('a',{'class': 'title'}).get_text()
title = remove_reddit_tag.sub('', item)[:100]
url = block.get('data-url')
name_link.append((title, url))
print(url, title)
for title, url in name_link:
urlretrieve(url, title)