以下是一个例子:[http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/][1]
理想情况下,您可以看到带有以下字段的整齐抓取和提取的输出数据数组:
来自每个特定的公司数据页面。我纯粹是初学者,我想知道如何自动提取链接。在此代码中,我将手动送给它。任何人都可以帮助我。
import scrapy
from spy.items import SpyItem
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.linkextractors import LinkExtractor
class ProjectSpider(CrawlSpider):
name = "project"
allowed_domains = ["cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/"]
start_urls = [100Links in here]
def parse(self, response):
item = SpyItem()
item['title'] = response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[1]/strong/text()').extract()
item['Business'] =response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[4]/text()').extract()
item['website'] =response.xpath('//p[5]/a/text()').extract()
item['Ranking']=response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[2]/text()[1]').extract()
item['HQ']=response.css('p:nth-child(12)::text').extract()
item['Revenue2015']=response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[7]/text()').extract()
item['Revenue2014']=response.css('p:nth-child(10)::text').extract()
item['YearFounded']=response.xpath('//p[11]/text()').extract().encode('utf-8')
item['Employees']=response.xpath('//article/div[3]/p[12]/text()').extract()
item['FamilyOwned']=response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[13]/text()').extract()
yield item
答案 0 :(得分:-1)
您的代码至少存在两个问题。
allowed_domain
必须是域名。不多了。CrawlSpider
一起使用的Rules
。你没有任何规则。以下是一些经过测试的代码作为起点:
import scrapy
class ProjectItem(scrapy.Item):
title = scrapy.Field()
owned = scrapy.Field()
class ProjectSpider(scrapy.Spider):
name = "cin100"
allowed_domains = ['cincinnati.com']
start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']
def parse(self, response):
# get selector for all 100 companies
sel_companies = response.xpath('//p[contains(.,"Here are the companies")]/following-sibling::p/a')
# create request for every single company detail page from href
for sel_companie in sel_companies:
href = sel_companie.xpath('./@href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_company_detail)
yield request
def parse_company_detail(self, response):
# On detail page create item
item = ProjectItem()
# get detail information with specific XPath statements
# e.g. title is the first paragraph
item['title'] = response.xpath('//div[@role="main"]/p[1]//text()').extract_first()
# e.g. family owned has a label we can select
item['owned'] = response.xpath('//div[@role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
# find clever XPaths for other fields ...
# ...
# Finally: yield the item
yield item