Question

我试图从icc网站上淘汰前100名t20击球手但是我得到的csv文件是空白的。我的代码中没有错误（至少我不知道它们）。这是我的项目文件

import scrapy

class DmozItem(scrapy.Item):
    Ranking = scrapy.Field()
    Rating = scrapy.Field()
    Name = scrapy.Field()
    Nationality = scrapy.Field()
    Carer_Best_Rating = scrapy.Field()

dmoz_spider文件

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "espn"
    allowed_domains = ["relianceiccrankings.com"]
    start_urls = ["http://www.relianceiccrankings.com/ranking/t20/batting/"]

    def parse(self, response):
        #sel = response.selector
        #for tr in sel.css("table.top100table>tbody>tr"):
        for tr in response.xpath('//table[@class="top100table"]/tr'):
            item = DmozItem()
            item['Ranking'] = tr.xpath('//td[@class="top100id"]/text()').extract_first()
            item['Rating'] = tr.xpath('//td[@class="top100rating"]/text()').extract_first()
            item['Name'] = tr.xpath('td[@class="top100name"]/a/text()').extract_first()
            item['Nationality'] = tr.xpath('//td[@class="top100nation"]/text()').extract_first()
            item['Carer_Best_Rating'] = tr.xpath('//td[@class="top100cbr"]/text()').extract_first()
            yield item

我的代码出了什么问题？

Answer 1

您尝试废弃的网站上有一个框架，您想要废弃该框架。

start_urls = [
    "http://www.relianceiccrankings.com/ranking/t20/batting/"
]

这是正确的网址

此外还有更多错误，

要选择应使用response本身的元素，您无需使用response.selector启动变量，只需直接从response.xpath(//foo/bar)
< / LI>
表格的css选择器错误。 top100table是一个类而不是一个id因此应该是.top100table而不是#top100table。

这里只有xpath：

response.xpath("//table[@class='top100table']/tr")

tbody不是html代码的一部分，只有在您使用现代浏览器进行检查时才会显示。

extract()方法始终会返回一个列表，而不是元素本身，因此您需要提取您找到的第一个元素：

item['Ranking'] = tr.xpath('td[@class="top100id"]/a/text()').extract_first()

希望这有帮助，玩得开心！

Answer 2

要回答您的排名问题，排名的xpath以＆＃39; //...'开始;这意味着从页面开始＆＃39;。您需要它相对于tr。只需删除＆＃39; //＆＃39;来自for循环中的每个xpath。

item['Ranking'] = tr.xpath('td[@class="top100id"]/text()').extract_first()

我的Scrapy没有刮掉任何东西（空白的csv文件）

2 个答案: