Question

我从多个URL中提取数据并将其放入PostgreSQL数据库。我在执行以下代码时遇到困难。任何帮助将不胜感激。

tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
IndexError: list index out of range

这是我的完整源代码：

import csv
import urllib.request
import psycopg2
from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup

conn = psycopg2.connect(database='--',user='--', password='--', port=--)
cursor = conn.cursor()


for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)
    uClient = uReq(my_url)
    page1_html = uClient.read()
    uClient.close()
    # html parsing
    soup = soup(page1_html, 'html.parser')


    tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
    for row in tbody:
        cols = row.findChildren(recursive=False)
        cols = [ele.text.strip() for ele in cols]
        if cols:
            vehicle_type = str(cols[0])
            one_time = str(cols[1])
            return_type = str(cols[2])
            monthly_pass = str(cols[3])
            local_vehicle = str(cols[4])

            query = "INSERT INTO toll (vehicle_type, one_time, return_type, monthly_pass, local_vehicle) VALUES (%s, %s, %s, %s, %s);"
            data = (vehicle_type, one_time, return_type, monthly_pass, local_vehicle)
            cursor.execute(query, data)

# Commit the transaction
    conn.commit()

Answer 1

似乎某些网页缺少您的关键信息，您可以使用error-catching，例如：

try: 
    tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
except:
    continue  # Skip this page if no items were scrapped

如果发生任何错误，它会跳过此网址，请确保您知道自己在做什么。

Answer 2

从@pythonist添加答案，因为我没有评论的声誉;关注的表格tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]将不会在所有网页中包含数据，这就是您的代码因IndexError: list index out of range而失败的原因。

有关。例如：页面http://tis.nhai.gov.in/TollInformation?TollPlazaID=200包含表格，但http://tis.nhai.gov.in/TollInformation?TollPlazaID=2和许多内容同样可能不一样。

你可以简单地捕获错误，因为正如答案中提到的那样，因为解析那些不是真的存在的数据没有任何意义。

-----编辑----

分享代码：

page = soup(page1_html, 'html.parser')
try:

    tbody = page('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
    for row in tbody:
        cols = row.findChildren(recursive=False)
        cols = [ele.text.strip() for ele in cols]
        if cols:
            vehicle_type = str(cols[0])
            one_time = str(cols[1])
            return_type = str(cols[2])
            monthly_pass = str(cols[3])
            local_vehicle = str(cols[4])
except IndexError:
    continue

添加其余代码

IndexError：列表索引超出范围，而提取表

2 个答案: