我正在尝试创建一个简单的抓取工具,从网站中提取元数据并将信息保存到csv中。到目前为止,我被困在这里,我已经按照一些指南,但我现在坚持错误:
IndexError:索引列表超出范围。
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')
# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')
# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)
# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article
articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article
divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div
# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)
# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')
# Print all of the paragraphs to screen
for i in paragList:
print i
print '\n'
# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)
print soup2.findAll('title')
print soup2.findAll('link')
titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')
for i in listIterator:
print titleSoup[i]
print linkSoup[i]
print '\n'
非常感谢任何帮助。
我得到的错误是
File "C:\Users......", line 24, in (module)
print findPatTitle[i] # the title
IndexError:list of index out of range
谢谢。
答案 0 :(得分:0)
似乎你没有使用bs4可以给你的所有力量。
您收到此错误是因为patFinderTitle的长度只有一个,因为所有html通常每个文档只有一个标题元素。
获取HTML标题的一种简单方法是使用bs4本身:
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)
# get the content of title
title = soup.title.text
如果你尝试以当前的方式迭代你的findPatLink,你可能会得到同样的错误,因为它的长度为6.对我来说,如果你想得到所有的链接元素或所有的锚点,那就不够清楚了元素,但坚持第一个想法,你可以再次使用bs4改进你的代码:
link_href_list = [link['href'] for link in soup.find_all("link")]
最后,由于您不想要一些网址,因此您可以按照自己的方式对link_href_list进行切片。排除第一个和第二个结果的最后一个表达式的改进版本可能是:
link_href_list = [link['href'] for link in soup.find_all("link")[2:]]