Question

我正在尝试在一个我不知道URL结构的网站上使用Scrapy。

我想：

仅从包含Xpath“// div [@ class =”product-view“]”的网页中提取数据。
提取打印（CSV）URL，名称和价格X路

当我运行以下脚本时，我得到的只是URL的随机列表

from scrapy.selector import HtmlXPathSelector from scrapy.spider import BaseSpider from scrapy.http import Request DOMAIN = 'site.com' URL = 'http://%s' % DOMAIN class MySpider(BaseSpider): name = "dmoz" allowed_domains = [DOMAIN] start_urls = [ URL ] def parse(self, response): for url in response.xpath('//a/@href').extract(): if not ( url.startswith('http://') or url.startswith('https://') ): url= URL + url if response.xpath('//div[@class="product-view"]'): url = response.extract() name = response.xpath('//div[@class="product-name"]/h1/text()').extract() price = response.xpath('//span[@class="product_price_details"]/text()').extract() yield Request(url, callback=self.parse) print url

{{1}}

Answer 1

您在这里看到的是scrapy.spiders.Crawlspider。

然而，你几乎用你自己的方法得到它。这是固定版本。

from scrapy.linkextractors import LinkExtractor
def parse(self, response):
    # parse this page
    if response.xpath('//div[@class="product-view"]'):
        item = dict()
        item['url'] = response.url
        item['name'] = response.xpath('//div[@class="product-name"]/h1/text()').extract_first()
        item['price'] = response.xpath('//span[@class="product_price_details"]/text()').extract_first()
        yield item  # return an item with your data
    # other pages
    le = LinkExtractor()  # linkextractor is smarter than xpath '//a/@href'
    for link in le.extract_links(response):
        yield Request(link.url)  # default callback is already self.parse

现在您只需运行scrapy crawl myspider -o results.csv，scrapy就会输出您商品的csv。虽然特别关注日志和统计数据，但是你知道出了什么问题

使用Scrapy进行条件URL抓取

1 个答案: