从IMDB抓取电影数据时的MemoryError

时间:2017-08-06 10:03:01

标签: python web-scraping out-of-memory

我写了一个程序来刮掉IMDB来保存我看过的电影的信息。名单中约有1500部电影。出了点问题,我无法弄清楚是什么。

这是代码:

from __future__ import division
import json
import io
from pprint import pprint
import urllib2
from bs4 import BeautifulSoup
import csv
import time
import os
import psutil


links = [u'http://www.imdb.com/title/tt0354899', u'http://www.imdb.com/title/tt1020530', u'http://www.imdb.com/title/tt0099864', u'http://www.imdb.com/title/tt0100157', u'http://www.imdb.com/title/tt0324216', u'http://www.imdb.com/title/tt0054215', u'http://www.imdb.com/title/tt0435625', u'http://www.imdb.com/title/tt0454841', u'http://www.imdb.com/title/tt0450278', u'http://www.imdb.com/title/tt1179904']



# relevant part starts here. "Links" is a list of links to IMDB pages

information = [["title", "date", "duration", "rating"]]
for number, link in enumerate(links):
    response = urllib2.urlopen(link)
    html = response.read()
    soup = BeautifulSoup(html, "lxml")
    print "\n", link
    print "Percentage completed", number/len(links)
    print "Time elapsed", time.clock() - t0
    try:
        date = soup.find(itemprop="datePublished")["content"]
        title = soup.find(property='og:title')["content"]
        duration = soup.find(itemprop="duration").string.replace(" ", "").replace("\n", "")
        rating = soup.find(itemprop="ratingValue").string
        print date, title, duration, rating
    except KeyError:
        print "There was a problem"

这是一段时间后抛出的MemoryError:

Traceback (most recent call last):
  File "C:/Users/pplsuser/Dropbox/Stupidaggini/IMDBscraper/scraper.py", line 28, in <module>
    html = response.read()
  File "C:\Python27\lib\socket.py", line 362, in read
    buf.write(data)
MemoryError: out of memory

我不确定发生了什么,所以不确定如何修复它。似乎内存消耗不断增加。有许多链接,所以也许正在逐步保存并吸收越来越多的内存,但我无法发现可能正在做的事情。

0 个答案:

没有答案