这里有什么错误?

时间:2016-10-12 15:31:07

标签: python python-3.x scrapy web-crawler scrapy-spider

这些是我的代码,但它似乎是正确的,但它不起作用,请帮助

HEADER_XPATH = ['//h1[@class="story-body__h1"]//text()']    
AUTHOR_XPATH = ['//span[@class="byline__name"]//text()']   
PUBDATE_XPATH = ['//div/@data-datetime']  
WTAGS_XPATH = ['']   
CATEGORY_XPATH = ['//span[@rev="news|source""]//text()']    
TEXT = ['//div[@property="articleBody"]//p//text()']   
INTERLINKS = ['//div[@class="story-body__link"]//p//a/@href']  
DATE_FORMAT_STRING = '%Y-%m-%d'

class BBCSpider(Spider):
    name = "bbc"
    allowed_domains = ["bbc.com"]
    sitemap_urls = [
        'http://Www.bbc.com/news/sitemap/',
        'http://www.bbc.com/news/technology/',
        'http://www.bbc.com/news/science_and_environment/']

    def parse_page(self, response):
        items = []
        item = ContentItems()
        item['title'] = process_singular_item(self, response, HEADER_XPATH, single=True)
        item['resource'] = urlparse(response.url).hostname
        item['author'] = process_array_item(self, response, AUTHOR_XPATH, single=False)
        item['pubdate'] = process_date_item(self, response, PUBDATE_XPATH, DATE_FORMAT_STRING, single=True)
        item['tags'] = process_array_item(self, response, TAGS_XPATH, single=False)
        item['category'] = process_array_item(self, response, CATEGORY_XPATH, single=False)
        item['article_text'] = process_article_text(self, response, TEXT)
        item['external_links'] = process_external_links(self, response, INTERLINKS, single=False)
        item['link'] = response.url
        items.append(item)
        return items

1 个答案:

答案 0 :(得分:0)

你的蜘蛛结构糟糕,因此没什么作用 scrapy.Spider蜘蛛需要start_urls类属性,该属性应该包含蜘蛛用于开始抓取的网址列表,所有这些网址都会回调到类方法parse,这意味着它需要好。

你的蜘蛛有sitemap_urls类属性并且它没有在任何地方使用,你的蜘蛛也有parse_page类方法,它从未在任何地方使用过。
所以简而言之,你的蜘蛛看起来应该是这样的:

class BBCSpider(Spider):
    name = "bbc"
    allowed_domains = ["bbc.com"]
    start_urls = [
        'http://Www.bbc.com/news/sitemap/',
        'http://www.bbc.com/news/technology/',
        'http://www.bbc.com/news/science_and_environment/']

    def parse(self, response):
        # This is a page with all of the articles
        article_urls = # find article urls in the pages
        for url in article_urls:
            yield Request(url, self.parse_page)

    def parse_page(self, response):
        # This is an article page
        items = []
        item = ContentItems()
        # populate item
        return item