Question

我正在尝试编写代码，使用 Python 及其 urllib2 和 BeautifulSoup 库从网站中提取数据。

我尝试迭代所需表的行，然后将“td”中指定的每一行中的数据存储到列表变量row_data中。事件虽然我可以打印整个列表，但我无法访问特定索引的列表，并且解释器会抛出“list index out of range”错误。这是我的代码和输出。

import urllib2
from bs4 import BeautifulSoup

link = 'http://www.babycenter.in/a25008319/most-popular-indian-baby-names-of-2013'
page = urllib2.urlopen(link)
soup = BeautifulSoup(page)
right_table = soup.find('table', class_= 'contentTable colborders')
name=[]
meaning=[]
alternate=[]

for row in right_table.find_all("tr"):
  row_datas = row.find_all("td")
  print row_datas
  print row_datas[0]

输出：

[]Traceback (most recent call last):
  File "C:\Users\forcehandler\Documents\python\data_scrape.py", line 41, in <module>

print row_datas[0]
IndexError: list index out of range
[Finished in 1.6s]

我尝试过类似的代码来标出任何明显的错误，但无济于事。代码：

i = [range(y,10) for y in range(5)]
for j in i:
  print j
  print j[0]

输出：

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
0
[1, 2, 3, 4, 5, 6, 7, 8, 9]
1
[2, 3, 4, 5, 6, 7, 8, 9]
2
[3, 4, 5, 6, 7, 8, 9]
3
[4, 5, 6, 7, 8, 9]
4

我是编程新手，无法在其他任何地方找到帮助。提前谢谢！

编辑：在复制粘贴时，Traceback可能意外地滑入输出之前的'[]'。并感谢您提供有用的答案/建议。

解决方案：在使用数据之前，我没有检查数据的完整性。事实证明，第一行只包含'th'值而没有'td'值，因而错误。

课程：在将数据用于任何用途之前，请务必对其进行测试。

旁注：这是我在StackOverflow上的第一个问题，我对这种快速，有质量和有用的回复感到不知所措。

Answer 1

您的输出显示至少有一行为空：

[]Traceback (most recent call last):
^^

[]是一个空列表，输出是由print row_datas行生成的。通常情况下，我希望在Traceback之间有换行符;也许你没有正确复制你的输出，或者你有一个控制台使用一个大小的缓冲区而不是行缓冲，导致它混合stdout和stderr。

那是因为这些行中的第一行中包含th个标题单元格：

>>> rows = soup.select('table.contentTable tr')
>>> rows[0].find('td') is None
True
>>> rows[0].find_all('th')
[<th width="20%">Name</th>, <th>Meaning</th>, <th>Popular <br/>\nalternate spellings</th>]

还有另一个这样的行，所以你必须采取防御措施：

>>> rows[26]
<tr><th width="20%">Name</th><th>Meaning</th><th>Popular <br/>\nalternate spellings</th></tr>

您可以测试是否有任何带有if语句的元素：

if row_datas:
    print row_datas[0]

提取所有名称，含义和替代拼写的代码非常简单：

for row in soup.select('table.contentTable tr'):
    cells = row.find_all('td')
    if not cells:
        continue
    name_link = cells[0].find('a')
    name, link = name_link.get_text(strip=True), name_link.get('href')
    meaning, alt = (cell.get_text(strip=True) for cell in cells[1:])
    print '{}: {} ({})'.format(name, meaning, alt)

Answer 2

您收到此错误是因为您的列表中没有元素，row.find_all("td")无法找到任何内容，您必须检查html结构或使用select方法。< / p>

select()返回通过CSSS选择器选出的所有元素，它非常强大，您的代码将是这样的：

 row_datas = soup.select("td") #Note that select() is method of a BeautifulSoup Object .
  print row_datas
  print row_datas[0]

Python：即使它存在，也无法访问列表元素

2 个答案: