Question

所以我想在这里找到的表格可以在这里找到：http://www.betdistrict.com/tipsters

我在名为'June Stats'的桌子后面。

这是我的蜘蛛：

from __future__ import division
from decimal import *

import scrapy
import urlparse

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider):
name = "betdistrict"
allowed_domains = ["betdistrict.com"]
start_urls = ["http://www.betdistrict.com/tipsters"]

def parse(self, response):
    for sel in response.xpath('//table[1]/tr'):
        item = TtscrapeItem()
        name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]
        url = sel.xpath('td[@class="tipst"]/a/@href').extract()[0]
        tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'
        item['Tipster'] = tipster
        won = sel.xpath('td[2]/text()').extract()[0]
        lost = sel.xpath('td[3]/text()').extract()[0]
        void = sel.xpath('td[4]/text()').extract()[0]
        tips = int(won) + int(void) + int(lost)
        item['Tips'] = tips
        strike = Decimal(int(won) / tips) * 100
        strike = str(round(strike,2))
        item['Strike'] = [strike + "%"]
        profit = sel.xpath('//td[5]/text()').extract()[0]
        if profit[0] in ['+']:
            profit = profit[1:]
        item['Profit'] = profit
        yield_str = sel.xpath('//td[6]/text()').extract()[0]
        yield_str = yield_str.replace(' ','')
        if yield_str[0] in ['+']:
            yield_str = yield_str[1:]
        item['Yield'] = '<span style="color: #40AA40">' + yield_str + '%</span>'
        item['Site'] = 'Bet District'
        yield item

这给了我第一个变量（名称）上的列表索引超出范围错误。

但是，当我以//开头重写我的xpath选择器时，例如：

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

蜘蛛跑了，但一遍又一遍地刮擦第一个推特。

我认为它与没有thead的表有关，但在tbody的第一个tr中包含th个标签。

非常感谢任何帮助。

---------- ---------- EDIT

回应Lars建议：

我已尝试使用您建议的内容，但仍会获得超出范围错误的列表：

from __future__ import division
from decimal import *

import scrapy
import urlparse

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider):
    name = "betdistrict"
    allowed_domains = ["betdistrict.com"]
    start_urls = ["http://www.betdistrict.com/tipsters"]

def parse(self, response):
    for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):
        item = TtscrapeItem()
        name = sel.xpath('a/text()').extract()[0]
        url = sel.xpath('a/@href').extract()[0]
        tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'
        item['Tipster'] = tipster
        yield item

另外，我假设通过这种方式做事，需要多个for循环，因为并非所有单元都具有相同的类？

我也尝试过没有for循环的事情，但在这种情况下，它再一次只刮掉第一个推特：s

由于

Answer 1

当你说

时

name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]

XPath表达式以td开头，因此相对于变量sel中的上下文节点（即tr集合中的tr元素for循环遍历的元素。）

但是当你说

时

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

XPath表达式以//td开头，即选择文档中任何位置的所有td元素;这与sel无关，因此在for循环的每次迭代中结果都是相同的。这就是为什么它一遍又一遍地刮擦第一个推特。

为什么第一个XPath表达式失败，列表索引超出范围错误？尝试一次采用XPath表达式一个位置步骤，打印出结果，您很快就会发现问题。在这种情况下，似乎是因为tr的第一个table[1]孩子没有td个孩子（只有th个孩子）。因此xpath()不选择任何内容，extract()返回一个空列表，并尝试引用该空列表中的第一项，从而使列表索引超出范围错误。

要解决此问题，您可以将for循环XPath表达式更改为仅循环具有tr个子元素的td个元素：

for sel in response.xpath('//table[1]/tr[td]'):

你可能会变得更加漂亮，需要td合适的班级：

for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):

用scrapy抓住特定的桌子

1 个答案: