Question

这是我第一次创造蜘蛛的尝试，如果我没有正确完成它，请饶恕我。这是我试图从中提取数据的网站的链接。 http://www.4icu.org/in/。我想要在页面上显示的整个学院列表。但是，当我运行以下蜘蛛时，我返回一个空的json文件。我的items.py

    import scrapy
    class CollegesItem(scrapy.Item):
    # define the fields for your item here like:
        link = scrapy.Field()

这是蜘蛛 colleges.py

    import scrapy
    from scrapy.spider import Spider
    from scrapy.http import Request

    class CollegesItem(scrapy.Item):
    # define the fields for your item here like:
        link = scrapy.Field()

    class CollegesSpider(Spider):
        name = 'colleges'
        allowed_domains = ["4icu.org"]
        start_urls = ('http://www.4icu.org/in/',)

        def parse(self, response):
            return Request(
                url = "http://www.4icu.org/in/",
                callback = self.parse_fixtures
            )
        def parse_fixtures(self,response):
            sel = response.selector
            for div in sel.css("col span_2_of_2>div>tbody>tr"):
                item = Fixture()
                item['university.name'] = tr.xpath('td[@class="i"]/span  /a/text()').extract()
                yield item

Answer 1

正如问题评论中所述，您的代码存在一些问题。

首先，您不需要两种方法 - 因为在parse方法中，您调用的网址与start_urls中的网址相同。

要从网站获取一些信息，请尝试使用以下代码：

def parse(self, response):
    for tr in response.xpath('//div[@class="section group"][5]/div[@class="col span_2_of_2"][1]/table//tr'):
        if tr.xpath(".//td[@class='i']"):
            name = tr.xpath('./td[1]/a/text()').extract()[0]
            location = tr.xpath('./td[2]//text()').extract()[0]
            print name, location

并根据您的需要调整您的项目（或项目）。

正如您所看到的，您的浏览器会在tbody中显示额外的table，当您使用Scrapy时，该img{width:100px} // change this as per you requirement不存在。这意味着您经常需要判断您在浏览器中看到的内容。

Answer 2

这是工作代码

    import scrapy
    from scrapy.spider import Spider
    from scrapy.http import Request

    class CollegesItem(scrapy.Item):
    # define the fields for your item here like:
        name = scrapy.Field()
        location = scrapy.Field()
    class CollegesSpider(Spider):
        name = 'colleges'
        allowed_domains = ["4icu.org"]
        start_urls = ('http://www.4icu.org/in/',)

        def parse(self, response):
            for tr in response.xpath('//div[@class="section group"] [5]/div[@class="col span_2_of_2"][1]/table//tr'):
                if tr.xpath(".//td[@class='i']"):
                    item = CollegesItem()
                    item['name'] = tr.xpath('./td[1]/a/text()').extract()[0]
                    item['location'] = tr.xpath('./td[2]//text()').extract()[0]
                    yield item

运行命令后蜘蛛

    >>scrapy crawl colleges -o mait.json

以下是结果摘录：

    [[[[[[[{"name": "Indian Institute of Technology Bombay", "location": "Mumbai"},
    {"name": "Indian Institute of Technology Madras", "location": "Chennai"},
    {"name": "University of Delhi", "location": "Delhi"},
    {"name": "Indian Institute of Technology Kanpur", "location": "Kanpur"},
    {"name": "Anna University", "location": "Chennai"},
    {"name": "Indian Institute of Technology Delhi", "location": "New Delhi"},
    {"name": "Manipal University", "location": "Manipal ..."},
    {"name": "Indian Institute of Technology Kharagpur", "location": "Kharagpur"},
    {"name": "Indian Institute of Science", "location": "Bangalore"},
    {"name": "Panjab University", "location": "Chandigarh"},
    {"name": "National Institute of Technology, Tiruchirappalli", "location": "Tiruchirappalli"}, .........

scrapy蜘蛛没有返回任何结果

2 个答案: