我是Python的新手,只是想创建一个webscraper。在我开始构建列表之前,当第一个索引的变量设置为0时,我无法弄清楚为什么我的索引列表超出范围。
import requests
from bs4 import BeautifulSoup
def kijiji_spider(max_pages):
page = 1
while page <= max_pages:
url = "http://www.kijiji.ca/b-cars-trucks/alberta/convertible__coupe__hatchback__other+body+type__sedan__wagon/page-" + str(page) + "/c174l9003a138?price=__5000"
sourcecode = requests.get(url)
plain_text = sourcecode.text
soup = BeautifulSoup(plain_text)
a = 0
lista=[]
for link in soup.find_all("a", {"class": "title"}):
if a == 0:
href = "|http://www.kijiji.ca" + link.get("href")
lista.append(href)
elif a != 0:
href = "http://www.kijiji.ca" + link.get("href")
lista.append(href)
a += 1
i = 0
listb = []
for link in soup.find_all("a", {"class": "title"}):
title = link.string
listb[i] = listb[i] + "|" + title.strip()
i += 1
x = 0
listc = []
for other in soup.find_all("td", {"class": "price"}):
price = other.string
listc[x] = listc[x] + "|" + price.strip()
x += 1
page += 1
print(lista)
print(listb)
print(listc)
kijiji_spider(1)
答案 0 :(得分:2)
您的listb
为空,然后您尝试访问其中的0
项。由于它是空的,因此无法访问任何内容,因此您获得IndexError
例外:
i = 0
listb = []
for link in soup.find_all("a", {"class": "title"}):
title = link.string
listb[i] = listb[i] + "|" + title.strip()
i += 1
我认为您要在此处执行的操作会附加到您创建的第一个列表(lista
)中的值,因此您可能需要listb.append(lista[i] + '|' + title.split())
。
您不需要Python中的列表计数器,只需附加到列表中它就会自动增长。
我不确定您在网址前添加|
的原因,但您的整个代码可以简化为以下内容:
def kijiji_spider(max_pages):
page = 1
collected_urls = [] # store all URLs on each "run"
while page <= max_pages:
url = "http://www.kijiji.ca/b-cars-trucks/alberta/convertible__coupe__hatchback__other+body+type__sedan__wagon/page-" + str(page) + "/c174l9003a138?price=__5000"
sourcecode = requests.get(url)
plain_text = sourcecode.text
soup = BeautifulSoup(plain_text)
links = [i.get('href') for i in soup.find_all('a', {'class': 'title'})]
titles = [i.string.strip() for i in soup.find_all('a', {'class': 'title'})]
prices = [i.string.strip() for i in soup.find_all("td", {"class": "price"})]
results = zip(links, titles, prices)
collected_urls.append(results)
page += 1
data = kijiji_spider(5)
for results in data:
for link, title, price in results:
print('http://www.kijiji.ca{} | {} | {}'.format(link, title, price))