(非常)Python和一般编程新手
我一直在尝试使用Scrapy从同一网站的更多页面/部分抓取数据
我的代码有效,但它不可读且不实用
import scrapy
class SomeSpider(scrapy.Spider):
name = 'some'
allowed_domains = ['https://example.com']
start_urls = [
'https://example.com/Python/?k=books&p=1',
'https://example.com/Python/?k=books&p=2',
'https://example.com/Python/?k=books&p=3',
'https://example.com/Python/?k=tutorials&p=1',
'https://example.com/Python/?k=tutorials&p=2',
'https://example.com/Python/?k=tutorials&p=3',
]
def parse(self, response):
response.selector.remove_namespaces()
info1 = response.css("scrapedinfo1").extract()
info2 = response.css("scrapedinfo2").extract()
for item in zip(scrapedinfo1, scrapedinfo2):
scraped_info = {
'scrapedinfo1': item[0],
'scrapedinfo2': item[1]}
yield scraped_info
我该如何改善这个?
我想在一定数量的类别和网页中进行搜索
我需要像
这样的东西categories = [books, tutorials, a, b, c, d, e, f]
in a range(1,3)
因此,Scrapy可以在所有类别和页面中完成工作,同时易于编辑和适应其他网站
欢迎任何想法
我尝试过:
categories = ["books", "tutorials"]
base = "https://example.com/Python/?k={category}&p={index}"
def url_generator():
for category, index in itertools.product(categories, range(1, 4)):
yield base.format(category=category, index=index)
但Scrapy返回
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
答案 0 :(得分:1)
感谢start_requests()
和yield scrapy.Request()
这是代码
import scrapy
import itertools
class SomeSpider(scrapy.Spider):
name = 'somespider'
allowed_domains = ['example.com']
def start_requests(self):
categories = ["books", "tutorials"]
base = "https://example.com/Python/?k={category}&p={index}"
for category, index in itertools.product(categories, range(1, 4)):
yield scrapy.Request(base.format(category=category, index=index))
def parse(self, response):
response.selector.remove_namespaces()
info1 = response.css("scrapedinfo1").extract()
info2 = response.css("scrapedinfo2").extract()
for item in zip(info1, info2):
scraped_info = {
'scrapedinfo1': item[0],
'scrapedinfo2': item[1],
}
yield scraped_info
答案 1 :(得分:0)
您可以使用方法start_requests()
在开始时使用yield Request(url)
生成网址。
顺便说一句:在parse()
的后期,您还可以使用yield Request(url)
添加新网址。
我使用为测试蜘蛛而创建的门户网站toscrape.com。
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['http://quotes.toqoute.com']
#start_urls = []
tags = ['love', 'inspirational', 'life', 'humor', 'books', 'reading']
pages = 3
url_template = 'http://quotes.toscrape.com/tag/{}/page/{}'
def start_requests(self):
for tag in self.tags:
for page in range(self.pages):
url = self.url_template.format(tag, page)
yield scrapy.Request(url)
def parse(self, response):
# test if method was executed
print('url:', response.url)
# --- run it without project ---
from scrapy.crawler import CrawlerProcess
#c = CrawlerProcess({
# 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
# 'FEED_FORMAT': 'csv',
# 'FEED_URI': 'output.csv',
#}
c = CrawlerProcess()
c.crawl(MySpider)
c.start()