Python BS4 crawler indexerror

时间:2014-12-10 11:23:01

标签: python

我正在尝试创建一个简单的抓取工具,从网站中提取元数据并将信息保存到csv中。到目前为止,我被困在这里,我已经按照一些指南,但我现在坚持错误:

IndexError:索引列表超出范围。

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()

# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')

# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')

# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)

# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)

# Print out the results to screen
for i in listIterator:
    print findPatTitle[i] # The title
    print findPatLink[i] # The link to the original article

articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article

divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div

# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)

# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')

# Print all of the paragraphs to screen
for i in paragList:
    print i
    print '\n'

# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)

print soup2.findAll('title')
print soup2.findAll('link')

titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')

for i in listIterator:
    print titleSoup[i]
    print linkSoup[i]
    print '\n'

非常感谢任何帮助。

我得到的错误是

File "C:\Users......", line 24, in (module)
   print findPatTitle[i] # the title
IndexError:list of index out of range

谢谢。

1 个答案:

答案 0 :(得分:0)

似乎你没有使用bs4可以给你的所有力量。

您收到此错误是因为patFinderTitle的长度只有一个,因为所有html通常每个文档只有一个标题元素。

获取HTML标题的一种简单方法是使用bs4本身:

from bs4 import BeautifulSoup
from urllib import urlopen

webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)

# get the content of title
title = soup.title.text

如果你尝试以当前的方式迭代你的findPatLink,你可能会得到同样的错误,因为它的长度为6.对我来说,如果你想得到所有的链接元素或所有的锚点,那就不够清楚了元素,但坚持第一个想法,你可以再次使用bs4改进你的代码:

link_href_list = [link['href'] for link in soup.find_all("link")]

最后,由于您不想要一些网址,因此您可以按照自己的方式对link_href_list进行切片。排除第一个和第二个结果的最后一个表达式的改进版本可能是:

link_href_list = [link['href'] for link in soup.find_all("link")[2:]]