Question

一段时间以来，我一直在尝试制作一个用于右移的网络抓取工具，但是我遇到了一条错误消息，指出我的列表超出范围，因此遇到了麻烦。代码中没有错误，但是在运行时它拒绝将数据导出到CSV文件。

错误消息：

HTTP GET request to URL: https://www.rightmove.co.uk/property-for- sale/find.html?locationIdentifier=REGION%5E93917&index=0&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=24&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=48&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=72&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=96&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Program Files\JetBrains\PyCharm 2020.1.2\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2020.1.2\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/Me/PycharmProjects/myrightmove/script.py", line 61, in <module>
    scraper.run()
  File "C:/Users/Me/PycharmProjects/myrightmove/script.py", line 56, in run
    self.to_csv()
  File "C:/Users/Me/PycharmProjects/myrightmove/script.py", line 40, in to_csv
    writer = csv.DictWriter(csv_file, fieldnames=self.results[0].keys())
IndexError: list index out of range

工作示例：

import requests
from bs4 import BeautifulSoup
import csv


class RightmoveScraper:
    results = []

    def fetch(self, url):
        print('HTTP GET request to URL: %s' % url, end='')
        response = requests.get(url)
        print(' | Status code: %s' % response.status_code)

        return response

    def parse(self, html):
        content = BeautifulSoup(html, 'lxml')

        titles = [title.text.strip() for title in content.findAll('h2', {'class': 'propertyCard.title'})]
        addresses = [address['content'] for address in content.findAll('meta', {'itemprop': 'streetAddr'})]
        descriptions = [description.text for description in content.findAll('span', {'data-test': 'property-description'})]
        prices = [price.text.strip() for price in content.findAll('div', {'class': 'propertyCard-priceValue'})]
        dates = [date.text.split(' ')[-1] for date in content.findAll('span', {'class': 'propertyCard-branchSummary-addedOrReduced'})]
        sellers = [seller.text.split('by')[-1].strip() for seller in content.findAll('span', {'class': 'propertyCard-branchSummary-branchName'})]
        images = [image['src'] for image in content.findAll('img', {'itemprop': 'image'})]

        for index in range(0, len(titles)):
            self.results.append({
                'title': titles[index],
                'address': addresses[index],
                'description': descriptions[index],
                'price': prices[index],
                'date': dates[index],
                'seller': sellers[index],
                'image': images[index],
            })

    def to_csv(self):
        with open('rightmove.csv', 'w') as csv_file:
            writer = csv.DictWriter(csv_file, fieldnames=self.results[0].keys())
            writer.writeheader()

            for row in self.results:
                writer.writerow(row)

            print('Stored results to "rightmove.csv"')

    def run(self):
        for page in range(0, 5):
            index = page * 24
            url = 'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=' + str(index) + '&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords='

            response = self.fetch(url)
            self.parse(response.text)

        self.to_csv()


if __name__ == '__main__':
    scraper = RightmoveScraper()
    scraper.run()

关于如何解决此类问题的任何想法？

Answer 1

如果您追溯错误，并在self.results函数中打印出parse()的值，则很明显由于某种原因，您没有向self.results附加任何内容。

我检查了titles字段，看来您有错字：您正在搜索propertyCard.title的内容，而您可能应该搜索propertyCard-title < / strong>。

类似地，您应该遍历要添加到self.results的其余字段，并尝试在代码的那部分中找到任何错误（如下所示）。

（提示：检查addresses = ...行，并确保输入正确的itemprop值。）

titles = [title.text.strip() for title in content.findAll('h2', {'class': 'propertyCard-title'})] addresses = [address['content'] for address in content.findAll('meta', {'itemprop': 'streetAddr'})] descriptions = [description.text for description in content.findAll('span', {'data-test': 'property-description'})] prices = [price.text.strip() for price in content.findAll('div', {'class': 'propertyCard-priceValue'})] dates = [date.text.split(' ')[-1] for date in content.findAll('span', {'class': 'propertyCard-branchSummary-addedOrReduced'})] sellers = [seller.text.split('by')[-1].strip() for seller in content.findAll('span', {'class': 'propertyCard-branchSummary-branchName'})] images = [image['src'] for image in content.findAll('img', {'itemprop': 'image'})] for index in range(0, len(titles)): self.results.append({ 'title': titles[index], 'address': addresses[index], 'description': descriptions[index], 'price': prices[index], 'date': dates[index], 'seller': sellers[index], 'image': images[index], })

为什么我的索引超出范围？ IndexError：列表索引超出范围

1 个答案: