Web抓取:Xpath列表索引超出范围

时间:2018-11-04 14:01:55

标签: python list xpath web-scraping

当我运行以下代码时,我得到了超出范围消息的列表索引:

import requests
from lxml.html import fromstring

def get_values():
    print('executing get_values...')
    url = 'https://sports.yahoo.com/nba/stats/weekly/?sortStatId=POINTS_PER_GAME&selectedTable=0'
    response = requests.get(url)
    parser = fromstring(response.text)
    for i in parser.xpath('//tbody/tr')[:100]:
         **FGM = i.xpath('.//td[4]/span/text()')[0] #This runs with no error even though its has similar xpath.**
         print('FGM: ' + FGM)     
         G = i.xpath('.//td[2]/span/text()')[0]
         print(G)

values = get_values()

运行代码时,出现以下错误消息:

 G=i.xpath('/./td[2]/span/text()')[0]
 IndexError: list index out of range

我尝试使用以下语句进行调试。

print(parser.xpath('//tbody/tr/td[2]/span/text()')) #Returns list['4', '4', '3', '3', '3', '4', '4', '3', '2', '4', '3']
print(parser.xpath('//tbody/tr/td[2]/span/text()')[0]) #Returns value = 4
print(len(parser.xpath('//tbody/tr/td[2]/span/text()')[0])) # Returns value = 1

输出显示了预期值,因此我不确定其不起作用的原因。任何帮助将不胜感激!

2 个答案:

答案 0 :(得分:1)

失败是因为第二个<span>中并不总是有<td>。这应该起作用:

def get_values():
    print('executing get_values...')
    url = 'https://sports.yahoo.com/nba/stats/weekly/?sortStatId=POINTS_PER_GAME&selectedTable=0'
    response = requests.get(url)
    parser = fromstring(response.text)
    for i in parser.xpath('//tbody/tr')[:100]:
         FGM = i.xpath('.//td[4]/span/text()')[0] #This runs with no error even though its has similar xpath.**
         print('FGM: ' + FGM)
         G = i.xpath('.//td[2]/text()|.//td[2]/span/text()')[0]  # <--- Changed this
         print(G)

values = get_values()

答案 1 :(得分:1)

选择满足查询//foo/bar/qux的项的选择器与编写查询//foo然后对其进行迭代,然后期望所有这些元素都具有./bar/qux的选择器不同。可能有很多<foo>没有<bar><qux>

例如,在源代码中,我们看到一个<tr>

<tr class="Bgc(secondary-enhanced):h" data-reactid="1522">
    <th class="Px(cell-padding-x) Py(cell-padding-y) Bd...>

因此<tr>不包含任何<td>,而是<th>(用于标题行)。

def get_values():
    print('executing get_values...')
    url = 'https://sports.yahoo.com/nba/stats/weekly/?sortStatId=POINTS_PER_GAME&selectedTable=0'
    response = requests.get(url)
    parser = fromstring(response.text)
    for i in parser.xpath('//tbody/tr[td[4]/span and td[2]/span]')[:100]:
         FGM = i.xpath('.//td[4]/span/text()')[0] #This runs with no error even though its has similar xpath.  
         print('FGM: ' + FGM)
         G = i.xpath('.//td[2]/span/text()')[0]
         print(G)

在这里,最后两行不包括在结果中,因为它们没有包装在<span>标记中,因此您将需要做一些额外的查询来选择正确的行并提取正确的值。