我有一个简单的Sracpy
蜘蛛,它将域中的所有页面导出到一个单独的csv文件。大多数人建议为每个站点编写一个不同的蜘蛛,但是鉴于我所要求的信息非常简单,我认为弄清楚如何遍历域列表是有意义的。最终将有成千上万个我想从中获得链接的域,所有域的结构都非常不同,因此我希望蜘蛛可以扩展。
以下是蜘蛛从csv中提取的几行内容:
这是我最近的尝试:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from NONPROF.items import NonprofItem
from scrapy.http import Request
import pandas as pd
file_path = 'C:/listofdomains.csv'
open_list = pd.read_csv(file_path)
urlorgs = open_list.urls.tolist()
open_list2 = pd.read_csv(file_path)
domainorgs = open_list2.domain.tolist()
class Nonprof(CrawlSpider):
name = "responselist"
allowed_domains = domainorgs
start_urls = urlorgs
rules = [
Rule(LinkExtractor(
allow=['.*']),
callback='parse_item',
follow=True)
]
def parse_item (self, response):
item = NonprofItem()
item['responseurl'] = response.url
yield item
运行蜘蛛程序时,我没有看到明显的错误,但似乎没有产生任何结果:[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
所以我有几个问题: 1. scrapy能够处理这样的请求吗? 2.是否有更好的方法让Spider遍历域列表并匹配start_url?
答案 0 :(得分:0)
我认为主要问题是您没有使用start_request方法!
然后您可以确定每个URL的优先级
以下是有效的示例代码:
"""
For allowed_domains:
Let’s say your target url is https://www.example.com/1.html,
then add 'example.com' to the list.
"""
class crawler(CrawlSpider):
name = "crawler_name"
allowed_domains, urls_to_scrape = parse_urls()
rules = [
Rule(LinkExtractor(
allow=['.*']),
callback='parse_item',
follow=True)
]
def start_requests(self):
for i,url in enumerate(self.urls_to_scrape):
yield scrapy.Request(url=url.strip(),callback=self.parse_item, priority=i+1, meta={"pass_anydata_hare":1})
def parse_item(self, response):
response = response.css('logic')
yield {'link':str(response.url),'extracted data':[],"meta_data":'data you passed' }
我建议您阅读此页面以获取更多信息
https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests
希望这会有所帮助:)