Scrapy蜘蛛没有抓取所需的页面

时间:2016-05-25 08:42:07

标签: python scrapy-spider

这是我要抓取的网站链接。 http://search.epfoservices.in/est_search_display_result.php?pageNum_search=1&totalRows_search=72045&old_rg_id=AP&office_name=&pincode=&estb_code=&estb_name=&paging=paging 下面是我的刮板,因为这是第一次尝试报废,所以原谅愚蠢的错误。请查看并建议任何可能使我的代码运行的更改。

Items.py

http://maps.googleapis.com/maps/api/geocode/json?address=77379&sensor=true

epfocrawl1_spider.py

import scrapy


class EpfoCrawl2Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    from scrapy.item import Item, Field
    S_No = Field()
    Old_region_code = Field()
    Region_code = Field()
    Name = Field()
    Address = Field()
    Pin = Field()
    Epfo_office = Field()
    Under_Ro = Field()
    Under_Acc = Field()
    Payment = Field()
    pass
运行“scrapy crawl PfData”之后

以下是日志

import scrapy
from scrapy.selector import HtmlXPathSelector


class EpfoCrawlSpider(scrapy.Spider):
"""Spider for regularly updated search.epfoservices.in"""
name = "PfData"
allowed_domains = ["search.epfoservices.in"]
starturls = ["http://search.epfoservices.in/est_search_display_result.php?pageNum_search=1&totalRows_search=72045&old_rg_id=AP&office_name=&pincode=&estb_code=&estb_name=&paging=paging"]

def parse(self,response):
    hxs = HtmlXPathSelector(response)
    rows = hxs.select('//tr"]')
    items = []
    for val in rows:
        item = Val()
        item['S_no'] = val.select('/td[0]/text()').extract()
        item['Old_region_code'] = val.select('/td[1]/text').extract()
        item['Region_code'] = val.select('/td[2]/text()').extract()
        item['Name'] = val.select('/td[3]/text()').extract()
        item['Address'] = val.select('/td[4]/text()').extract()
        item['Pin'] = val.select('/td[5]/text()').extract()
        item['Epfo_office'] = val.select('/td[6]/text()').extract()
        item['Under_ro'] = val.select('/td[7]/text()').extract()
        item['Under_Acc'] = val.select('/td[8]/text()').extract()
        item['Payment'] = val.select('a/@href').extract()
        items.append(item)
        yield items

要求提出建议。

1 个答案:

答案 0 :(得分:0)

起始网址列表必须为start_urls而不是starturls