Question

以下是一个例子：[http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/][1]

理想情况下，您可以看到带有以下字段的整齐抓取和提取的输出数据数组：

公司名称
2016年排名
2015年排名
经营年限
业务描述
网站
2015年收入
2014年收入
HQ City
成立年份
员工
是家族所有吗？

来自每个特定的公司数据页面。我纯粹是初学者，我想知道如何自动提取链接。在此代码中，我将手动送给它。任何人都可以帮助我。

import scrapy
from spy.items import SpyItem

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.linkextractors import LinkExtractor

class ProjectSpider(CrawlSpider):
    name = "project"
    allowed_domains = ["cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/"]
    start_urls = [100Links in here]



def parse(self, response):
            item = SpyItem()
            item['title'] = response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[1]/strong/text()').extract()
            item['Business'] =response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[4]/text()').extract()
            item['website'] =response.xpath('//p[5]/a/text()').extract()
        item['Ranking']=response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[2]/text()[1]').extract()
        item['HQ']=response.css('p:nth-child(12)::text').extract()
        item['Revenue2015']=response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[7]/text()').extract()
        item['Revenue2014']=response.css('p:nth-child(10)::text').extract()
        item['YearFounded']=response.xpath('//p[11]/text()').extract().encode('utf-8')
        item['Employees']=response.xpath('//article/div[3]/p[12]/text()').extract()
        item['FamilyOwned']=response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[13]/text()').extract()
        yield item

Answer 1

您的代码至少存在两个问题。

allowed_domain必须是域名。不多了。
您使用与CrawlSpider一起使用的Rules。你没有任何规则。

以下是一些经过测试的代码作为起点：

import scrapy

class ProjectItem(scrapy.Item):
    title = scrapy.Field()
    owned = scrapy.Field()

class ProjectSpider(scrapy.Spider):

    name = "cin100"
    allowed_domains = ['cincinnati.com']
    start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']

    def parse(self, response):

        # get selector for all 100 companies
        sel_companies = response.xpath('//p[contains(.,"Here are the companies")]/following-sibling::p/a')

        # create request for every single company detail page from href
        for sel_companie in sel_companies:
            href = sel_companie.xpath('./@href').extract_first()
            url = response.urljoin(href)
            request = scrapy.Request(url, callback=self.parse_company_detail)
            yield request

    def parse_company_detail(self, response):

        # On detail page create item
        item = ProjectItem()
        # get detail information with specific XPath statements
        # e.g. title is the first paragraph
        item['title'] = response.xpath('//div[@role="main"]/p[1]//text()').extract_first()
        # e.g. family owned has a label we can select
        item['owned'] = response.xpath('//div[@role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
        # find clever XPaths for other fields ...
        # ...
        # Finally: yield the item
        yield item

如何从网站上搜索这些数据？

1 个答案: