一段时间以来,我一直在尝试制作一个用于右移的网络抓取工具,但是我遇到了一条错误消息,指出我的列表超出范围,因此遇到了麻烦。代码中没有错误,但是在运行时它拒绝将数据导出到CSV文件。
错误消息:
HTTP GET request to URL: https://www.rightmove.co.uk/property-for- sale/find.html?locationIdentifier=REGION%5E93917&index=0&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=24&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=48&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=72&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
HTTP GET request to URL: https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=96&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords= | Status code: 200
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm 2020.1.2\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2020.1.2\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Me/PycharmProjects/myrightmove/script.py", line 61, in <module>
scraper.run()
File "C:/Users/Me/PycharmProjects/myrightmove/script.py", line 56, in run
self.to_csv()
File "C:/Users/Me/PycharmProjects/myrightmove/script.py", line 40, in to_csv
writer = csv.DictWriter(csv_file, fieldnames=self.results[0].keys())
IndexError: list index out of range
工作示例:
import requests
from bs4 import BeautifulSoup
import csv
class RightmoveScraper:
results = []
def fetch(self, url):
print('HTTP GET request to URL: %s' % url, end='')
response = requests.get(url)
print(' | Status code: %s' % response.status_code)
return response
def parse(self, html):
content = BeautifulSoup(html, 'lxml')
titles = [title.text.strip() for title in content.findAll('h2', {'class': 'propertyCard.title'})]
addresses = [address['content'] for address in content.findAll('meta', {'itemprop': 'streetAddr'})]
descriptions = [description.text for description in content.findAll('span', {'data-test': 'property-description'})]
prices = [price.text.strip() for price in content.findAll('div', {'class': 'propertyCard-priceValue'})]
dates = [date.text.split(' ')[-1] for date in content.findAll('span', {'class': 'propertyCard-branchSummary-addedOrReduced'})]
sellers = [seller.text.split('by')[-1].strip() for seller in content.findAll('span', {'class': 'propertyCard-branchSummary-branchName'})]
images = [image['src'] for image in content.findAll('img', {'itemprop': 'image'})]
for index in range(0, len(titles)):
self.results.append({
'title': titles[index],
'address': addresses[index],
'description': descriptions[index],
'price': prices[index],
'date': dates[index],
'seller': sellers[index],
'image': images[index],
})
def to_csv(self):
with open('rightmove.csv', 'w') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=self.results[0].keys())
writer.writeheader()
for row in self.results:
writer.writerow(row)
print('Stored results to "rightmove.csv"')
def run(self):
for page in range(0, 5):
index = page * 24
url = 'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E93917&index=' + str(index) + '&propertyTypes=&mustHave=&dontShow=&furnishTypes=&keywords='
response = self.fetch(url)
self.parse(response.text)
self.to_csv()
if __name__ == '__main__':
scraper = RightmoveScraper()
scraper.run()
关于如何解决此类问题的任何想法?
答案 0 :(得分:1)
如果您追溯错误,并在self.results
函数中打印出parse()
的值,则很明显由于某种原因,您没有向self.results
附加任何内容。
我检查了titles
字段,看来您有错字:您正在搜索propertyCard.title
的内容,而您可能应该搜索propertyCard-title
< / strong>。
类似地,您应该遍历要添加到self.results
的其余字段,并尝试在代码的那部分中找到任何错误(如下所示)。
(提示:检查addresses = ...
行,并确保输入正确的itemprop
值。)
titles = [title.text.strip() for title in content.findAll('h2', {'class': 'propertyCard-title'})]
addresses = [address['content'] for address in content.findAll('meta', {'itemprop': 'streetAddr'})]
descriptions = [description.text for description in content.findAll('span', {'data-test': 'property-description'})]
prices = [price.text.strip() for price in content.findAll('div', {'class': 'propertyCard-priceValue'})]
dates = [date.text.split(' ')[-1] for date in content.findAll('span', {'class': 'propertyCard-branchSummary-addedOrReduced'})]
sellers = [seller.text.split('by')[-1].strip() for seller in content.findAll('span', {'class': 'propertyCard-branchSummary-branchName'})]
images = [image['src'] for image in content.findAll('img', {'itemprop': 'image'})]
for index in range(0, len(titles)):
self.results.append({
'title': titles[index],
'address': addresses[index],
'description': descriptions[index],
'price': prices[index],
'date': dates[index],
'seller': sellers[index],
'image': images[index],
})