如何从网站上搜索这些数据?

时间:2017-05-12 02:35:37

标签: web-scraping beautifulsoup scrapy

以下是一个例子:[http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/][1]

理想情况下,您可以看到带有以下字段的整齐抓取和提取的输出数据数组:

  • 公司名称
  • 2016年排名
  • 2015年排名
  • 经营年限
  • 业务描述
  • 网站
  • 2015年收入
  • 2014年收入
  • HQ City
  • 成立年份
  • 员工
  • 是家族所有吗?

来自每个特定的公司数据页面。我纯粹是初学者,我想知道如何自动提取链接。在此代码中,我将手动送给它。任何人都可以帮助我。

import scrapy
from spy.items import SpyItem

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.linkextractors import LinkExtractor

class ProjectSpider(CrawlSpider):
    name = "project"
    allowed_domains = ["cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/"]
    start_urls = [100Links in here]



def parse(self, response):
            item = SpyItem()
            item['title'] = response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[1]/strong/text()').extract()
            item['Business'] =response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[4]/text()').extract()
            item['website'] =response.xpath('//p[5]/a/text()').extract()
        item['Ranking']=response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[2]/text()[1]').extract()
        item['HQ']=response.css('p:nth-child(12)::text').extract()
        item['Revenue2015']=response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[7]/text()').extract()
        item['Revenue2014']=response.css('p:nth-child(10)::text').extract()
        item['YearFounded']=response.xpath('//p[11]/text()').extract().encode('utf-8')
        item['Employees']=response.xpath('//article/div[3]/p[12]/text()').extract()
        item['FamilyOwned']=response.xpath('//*[@id="overlay"]/div[2]/article/div[3]/p[13]/text()').extract()
        yield item

1 个答案:

答案 0 :(得分:-1)

您的代码至少存在两个问题。

  1. allowed_domain必须是域名。不多了。
  2. 您使用与CrawlSpider一起使用的Rules。你没有任何规则。
  3. 以下是一些经过测试的代码作为起点:

    import scrapy
    
    class ProjectItem(scrapy.Item):
        title = scrapy.Field()
        owned = scrapy.Field()
    
    class ProjectSpider(scrapy.Spider):
    
        name = "cin100"
        allowed_domains = ['cincinnati.com']
        start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']
    
        def parse(self, response):
    
            # get selector for all 100 companies
            sel_companies = response.xpath('//p[contains(.,"Here are the companies")]/following-sibling::p/a')
    
            # create request for every single company detail page from href
            for sel_companie in sel_companies:
                href = sel_companie.xpath('./@href').extract_first()
                url = response.urljoin(href)
                request = scrapy.Request(url, callback=self.parse_company_detail)
                yield request
    
        def parse_company_detail(self, response):
    
            # On detail page create item
            item = ProjectItem()
            # get detail information with specific XPath statements
            # e.g. title is the first paragraph
            item['title'] = response.xpath('//div[@role="main"]/p[1]//text()').extract_first()
            # e.g. family owned has a label we can select
            item['owned'] = response.xpath('//div[@role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
            # find clever XPaths for other fields ...
            # ...
            # Finally: yield the item
            yield item