我想扫描网站并下载图片。
例如,对于这样的网站网址:a.example.com/2vZBkE.jpg
,我需要机器人从a.example.com/aaaaaa.jpg
扫描到a.example.com/AAAAAA.jpg
到a.example.com/999999.jpg
,如果有图像,请保存URL或下载图像。
我尝试过使用Python和Scrapy,但我对它很新。 这是我可以去的地方:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from example.items import ExampleItem
class exampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://a.example.com/2vZBkE']
#rules = [Rule(LinkExtractor(allow=['/.*']),'parse_example')]
rules = (Rule(SgmlLinkExtractor(allow=('\/%s\/.*',)), callback='parse_example'),
)
def parse_example(self,response):
image = ExampleItem()
image['title']=response.xpath(\
"//h5[@id='image-title']/text()").extract()
rel = response.xpath("//img/@src").extract()
image ['image_urls'] = ['http:'+rel[0]]
return image
我想我需要改变这一行:
rules = (Rule(SgmlLinkExtractor(allow=('\/%s\/.*',)), callback='parse_example'),
)
以某种方式将%s
限制为6个字符并使Scrapy尝试可能的组合。有什么想法吗?
答案 0 :(得分:0)
我不知道Scrapy。但您可以使用requests
和itertools
from string import ascii_letters, digits
from itertools import product
import requests
# You need to implement this function to download images
# check this http://stackoverflow.com/questions/13137817
def download_image(url):
print url
def check_link(url):
r = requests.head(url)
if r.status_code == 200:
return True
return False
# Check all possible combinations and check them
def generate_links():
base_url = "http://a.example.com/{}.jpg"
for combination in product(ascii_letters+digits, repeat=5):
url = base_url.format("".join(combination))
if check_link(url):
download_image(url)
答案 1 :(得分:0)
提取链接之类的 HREF =" a.example.com/123456.jpg"
使用以下正则表达式:
" = \"(\ S + / [\ W \ d] {6} .JPG)"