我写了一个程序来刮掉IMDB来保存我看过的电影的信息。名单中约有1500部电影。出了点问题,我无法弄清楚是什么。
这是代码:
from __future__ import division
import json
import io
from pprint import pprint
import urllib2
from bs4 import BeautifulSoup
import csv
import time
import os
import psutil
links = [u'http://www.imdb.com/title/tt0354899', u'http://www.imdb.com/title/tt1020530', u'http://www.imdb.com/title/tt0099864', u'http://www.imdb.com/title/tt0100157', u'http://www.imdb.com/title/tt0324216', u'http://www.imdb.com/title/tt0054215', u'http://www.imdb.com/title/tt0435625', u'http://www.imdb.com/title/tt0454841', u'http://www.imdb.com/title/tt0450278', u'http://www.imdb.com/title/tt1179904']
# relevant part starts here. "Links" is a list of links to IMDB pages
information = [["title", "date", "duration", "rating"]]
for number, link in enumerate(links):
response = urllib2.urlopen(link)
html = response.read()
soup = BeautifulSoup(html, "lxml")
print "\n", link
print "Percentage completed", number/len(links)
print "Time elapsed", time.clock() - t0
try:
date = soup.find(itemprop="datePublished")["content"]
title = soup.find(property='og:title')["content"]
duration = soup.find(itemprop="duration").string.replace(" ", "").replace("\n", "")
rating = soup.find(itemprop="ratingValue").string
print date, title, duration, rating
except KeyError:
print "There was a problem"
这是一段时间后抛出的MemoryError:
Traceback (most recent call last):
File "C:/Users/pplsuser/Dropbox/Stupidaggini/IMDBscraper/scraper.py", line 28, in <module>
html = response.read()
File "C:\Python27\lib\socket.py", line 362, in read
buf.write(data)
MemoryError: out of memory
我不确定发生了什么,所以不确定如何修复它。似乎内存消耗不断增加。有许多链接,所以也许正在逐步保存并吸收越来越多的内存,但我无法发现可能正在做的事情。