为什么我的索引超出范围? IndexError:列表索引超出范围

时间:2020-07-05 23:54:48

标签: python python-3.x web-scraping pycharm

一段时间以来,我一直在尝试制作一个用于右移的网络抓取工具,但是我遇到了一条错误消息,指出我的列表超出范围,因此遇到了麻烦。代码中没有错误,但是在运行时它拒绝将数据导出到CSV文件。

错误消息:

HTTP GET request to URL: https://www.rightmove.co.uk/property-for- sale/find.html?locationIdentifier=REGION%5E93917&index=0&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=24&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=48&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=72&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=96&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Program Files\JetBrains\PyCharm 2020.1.2\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2020.1.2\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/Me/PycharmProjects/myrightmove/script.py", line 61, in <module>
    scraper.run()
  File "C:/Users/Me/PycharmProjects/myrightmove/script.py", line 56, in run
    self.to_csv()
  File "C:/Users/Me/PycharmProjects/myrightmove/script.py", line 40, in to_csv
    writer = csv.DictWriter(csv_file, fieldnames=self.results[0].keys())
IndexError: list index out of range

工作示例:

import requests
from bs4 import BeautifulSoup
import csv


class RightmoveScraper:
    results = []

    def fetch(self, url):
        print('HTTP GET request to URL: %s' % url, end='')
        response = requests.get(url)
        print(' | Status code: %s' % response.status_code)

        return response

    def parse(self, html):
        content = BeautifulSoup(html, 'lxml')

        titles = [title.text.strip() for title in content.findAll('h2', {'class': 'propertyCard.title'})]
        addresses = [address['content'] for address in content.findAll('meta', {'itemprop': 'streetAddr'})]
        descriptions = [description.text for description in content.findAll('span', {'data-test': 'property-description'})]
        prices = [price.text.strip() for price in content.findAll('div', {'class': 'propertyCard-priceValue'})]
        dates = [date.text.split(' ')[-1] for date in content.findAll('span', {'class': 'propertyCard-branchSummary-addedOrReduced'})]
        sellers = [seller.text.split('by')[-1].strip() for seller in content.findAll('span', {'class': 'propertyCard-branchSummary-branchName'})]
        images = [image['src'] for image in content.findAll('img', {'itemprop': 'image'})]

        for index in range(0, len(titles)):
            self.results.append({
                'title': titles[index],
                'address': addresses[index],
                'description': descriptions[index],
                'price': prices[index],
                'date': dates[index],
                'seller': sellers[index],
                'image': images[index],
            })

    def to_csv(self):
        with open('rightmove.csv', 'w') as csv_file:
            writer = csv.DictWriter(csv_file, fieldnames=self.results[0].keys())
            writer.writeheader()

            for row in self.results:
                writer.writerow(row)

            print('Stored results to "rightmove.csv"')

    def run(self):
        for page in range(0, 5):
            index = page * 24
            url = 'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=' + str(index) + '&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords='

            response = self.fetch(url)
            self.parse(response.text)

        self.to_csv()


if __name__ == '__main__':
    scraper = RightmoveScraper()
    scraper.run()

关于如何解决此类问题的任何想法?

1 个答案:

答案 0 :(得分:1)

如果您追溯错误,并在self.results函数中打印出parse()的值,则很明显由于某种原因,您没有向self.results附加任何内容。

我检查了titles字段,看来您有错字:您正在搜索propertyCard.title的内容,而您可能应该搜索propertyCard-title < / strong>。

类似地,您应该遍历要添加到self.results的其余字段,并尝试在代码的那部分中找到任何错误(如下所示)。

(提示:检查addresses = ...行,并确保输入正确的itemprop值。)

titles = [title.text.strip() for title in content.findAll('h2', {'class': 'propertyCard-title'})]
addresses = [address['content'] for address in content.findAll('meta', {'itemprop': 'streetAddr'})]
descriptions = [description.text for description in content.findAll('span', {'data-test': 'property-description'})]
prices = [price.text.strip() for price in content.findAll('div', {'class': 'propertyCard-priceValue'})]
dates = [date.text.split(' ')[-1] for date in content.findAll('span', {'class': 'propertyCard-branchSummary-addedOrReduced'})]
sellers = [seller.text.split('by')[-1].strip() for seller in content.findAll('span', {'class': 'propertyCard-branchSummary-branchName'})]
images = [image['src'] for image in content.findAll('img', {'itemprop': 'image'})]

for index in range(0, len(titles)):
    self.results.append({
        'title': titles[index],
        'address': addresses[index],
        'description': descriptions[index],
        'price': prices[index],
        'date': dates[index],
        'seller': sellers[index],
        'image': images[index],
    })