标签: python web-scraping scrapy web-crawler


我一直试图按照这个例子好几天仍然无法获得预期的输出。使用Scrapy教程,甚至从github repo下载一个确切的项目,但我得到的输出不是教程中描述的。

from scrapy.spiders import Spider
from scrapy.selector import Selector

from dirbot.items import Website

class DmozSpider(Spider):
name = "dmoz"
allowed_domains = [""]
start_urls = [

  def parse(self, response):
    The lines below is a spider contract. For more info see:

    @scrapes name
    sel = Selector(response)
    sites = sel.xpath('//ul[@class="directory-url"]/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.xpath('a/text()').extract()
        item['url'] = site.xpath('a/@href').extract()
        item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')

    return items

从github下载项目后,我在顶级目录中运行“scrapy crawl dmoz”。我得到以下输出:

2016-08-31 00:08:19 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2016-08-31 00:08:19 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'NEWSPIDER_MODULE': 'dirbot.spiders', 'SPIDER_MODULES': ['dirbot.spiders']}
2016-08-31 00:08:19 [scrapy] INFO: Enabled extensions:
2016-08-31 00:08:19 [scrapy] INFO: Enabled downloader middlewares:
2016-08-31 00:08:19 [scrapy] INFO: Enabled spider middlewares:
2016-08-31 00:08:19 [scrapy] INFO: Enabled item pipelines:
2016-08-31 00:08:19 [scrapy] INFO: Spider opened
2016-08-31 00:08:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-31 00:08:19 [scrapy] DEBUG: Telnet console listening on
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET> (referer: None)
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET> (referer: None)
2016-08-31 00:08:20 [scrapy] INFO: Closing spider (finished)
2016-08-31 00:08:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 16179,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 8, 31, 7, 8, 20, 314625),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 8, 31, 7, 8, 19, 882944)}
2016-08-31 00:08:20 [scrapy] INFO: Spider closed (finished)


[scrapy] DEBUG: Scraped from <200>
 {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
  'link': [u''],
  'title': [u'Text Processing in Python']}
[scrapy] DEBUG: Scraped from <200>
 {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
  'link': [u''],
  'title': [u'XML Processing with Python']}

3 个答案:

答案 0 :(得分:2)


def parse(self, response):
    sites = response.xpath('//div[@class="title-and-desc"]/a')
    for site in sites:
        item = dict()
        item['name'] = site.xpath("text()").extract_first() 
        item['url'] = site.xpath("@href").extract_first() 
        item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
        yield item

为了将来参考,您始终可以测试特定xpath是否与scrapy shell命令一起使用 例如我做了什么来测试这个:

$ scrapy shell ""
# test sites xpath
# ok it doesn't work, check out page in web browser
# find correct xpath and test that:
# 21 result nodes printed
# it works!

答案 1 :(得分:1)

以下是用于从中提取细节的Scrapy代码的更正    DMOZ:

import scrapy

class MozSpider(scrapy.Spider):
name = "moz"
allowed_domains = [""]
start_urls = ['',

    def parse(self, response):
        sites = response.xpath('//div[@class="title-and-desc"]')
        for site in sites:
            name = site.xpath('a/div[@class="site-title"]/text()').extract_first()
            url = site.xpath('a/@href').extract_first()
            description = site.xpath('div[@class="site-descr "]/text()').extract_first().strip()

            yield{'Name':name, 'URL':url, 'Description':description}

要将其导出为CSV,请打开终端/ CMD中的蜘蛛文件夹,然后键入:

scrapy crawl moz -o result.csv


import scrapy

class YlpSpider(scrapy.Spider):
name = "ylp"
allowed_domains = [""]
start_urls = ['']

    def parse(self, response):
        companies = response.xpath('//*[@class="info"]')

        for company in companies:
            name = company.xpath('h3/a/span[@itemprop="name"]/text()').extract_first()
            phone = company.xpath('div/div[@class="phones phone primary"]/text()').extract_first()
            website = company.xpath('div/div[@class="links"]/a/@href').extract_first()

            yield{'Name':name,'Phone':phone, 'Website':website}

要将其导出为CSV,请打开终端/ CMD中的蜘蛛文件夹,然后键入:

scrapy crawl ylp -o result.csv


import scrapy

class YlpSpider(scrapy.Spider):
    name = "yelp"
    allowed_domains = [""]
    start_urls = [',+CO']

    def parse(self, response):
        companies = response.xpath('//*[@class="biz-listing-large"]')

        for company in companies:
            name = company.xpath('.//span[@class="indexed-biz-name"]/a/span/text()').extract_first()
            address1 = company.xpath('.//address/text()').extract_first('').strip()
            address2 = company.xpath('.//address/text()[2]').extract_first('').strip()  # '' means the default attribute if not found to avoid adding None.
            address = address1 + " - " + address2
            phone = company.xpath('.//*[@class="biz-phone"]/text()').extract_first().strip()
            website = "" + company.xpath('.//@href').extract_first()

            yield{'Name':name, 'Address':address, 'Phone':phone, 'Website':website}

要将其导出为CSV,请打开终端/ CMD中的蜘蛛文件夹,然后键入:

scrapy crawl yelp -o result.csv

答案 2 :(得分:0)

import scrapy

class BlogSpider(scrapy.Spider): 
    name = 'blogspider'
    start_urls = ['']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
        yield {'title': title.css('a ::text').get()}
        for next_page in response.css(''):
            yield response.follow(next_page, self.parse)