这些是我的代码,但它似乎是正确的,但它不起作用,请帮助
HEADER_XPATH = ['//h1[@class="story-body__h1"]//text()']
AUTHOR_XPATH = ['//span[@class="byline__name"]//text()']
PUBDATE_XPATH = ['//div/@data-datetime']
WTAGS_XPATH = ['']
CATEGORY_XPATH = ['//span[@rev="news|source""]//text()']
TEXT = ['//div[@property="articleBody"]//p//text()']
INTERLINKS = ['//div[@class="story-body__link"]//p//a/@href']
DATE_FORMAT_STRING = '%Y-%m-%d'
class BBCSpider(Spider):
name = "bbc"
allowed_domains = ["bbc.com"]
sitemap_urls = [
'http://Www.bbc.com/news/sitemap/',
'http://www.bbc.com/news/technology/',
'http://www.bbc.com/news/science_and_environment/']
def parse_page(self, response):
items = []
item = ContentItems()
item['title'] = process_singular_item(self, response, HEADER_XPATH, single=True)
item['resource'] = urlparse(response.url).hostname
item['author'] = process_array_item(self, response, AUTHOR_XPATH, single=False)
item['pubdate'] = process_date_item(self, response, PUBDATE_XPATH, DATE_FORMAT_STRING, single=True)
item['tags'] = process_array_item(self, response, TAGS_XPATH, single=False)
item['category'] = process_array_item(self, response, CATEGORY_XPATH, single=False)
item['article_text'] = process_article_text(self, response, TEXT)
item['external_links'] = process_external_links(self, response, INTERLINKS, single=False)
item['link'] = response.url
items.append(item)
return items
答案 0 :(得分:0)
你的蜘蛛结构糟糕,因此没什么作用
scrapy.Spider
蜘蛛需要start_urls
类属性,该属性应该包含蜘蛛用于开始抓取的网址列表,所有这些网址都会回调到类方法parse
,这意味着它需要好。
你的蜘蛛有sitemap_urls
类属性并且它没有在任何地方使用,你的蜘蛛也有parse_page
类方法,它从未在任何地方使用过。
所以简而言之,你的蜘蛛看起来应该是这样的:
class BBCSpider(Spider):
name = "bbc"
allowed_domains = ["bbc.com"]
start_urls = [
'http://Www.bbc.com/news/sitemap/',
'http://www.bbc.com/news/technology/',
'http://www.bbc.com/news/science_and_environment/']
def parse(self, response):
# This is a page with all of the articles
article_urls = # find article urls in the pages
for url in article_urls:
yield Request(url, self.parse_page)
def parse_page(self, response):
# This is an article page
items = []
item = ContentItems()
# populate item
return item