我正在尝试编写一个用于从imdb top 250网页抓取数据的代码。我写的代码如下。代码有效并且给了我预期的结果。但我面临的问题在于代码返回的结果数量。当我在笔记本电脑上使用它时,会产生23个结果,即imdb列出的前23部电影。但是当我从我的一个朋友那里跑出来时,它产生了250个正确的结果。为什么会这样?应该怎么做以避免这种情况?
from bs4 import BeautifulSoup
import requests
import sys
from StringIO import StringIO
try:
import cPickle as pickle
except:
import pickle
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text)
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.titleColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]
imdb = []
print(len(movies))
for index in range(0, len(movies)):
data = {"movie": movies[index].get_text(),
"link": links[index],
"starCast": crew[index],
"rating": ratings[index],
"vote": votes[index]}
imdb.append(data)
print(imdb)
Test Run from my laptop result :
['9.21', '9.176', '9.015', '8.935', '8.914', '8.903', '8.892', '8.889', '8.877', '8.817', '8.786', '8.76', '8.737', '8.733', '8.716', '8.703', '8.7', '8.69', '8.69', '8.678', '8.658', '8.629', '8.619']
23
答案 0 :(得分:0)
我意识到这是一个非常古老的问题,但我喜欢这个想法足以使代码更好地工作。它现在通过变量提供更多的个人数据。我为自己修好了,但我想在这里分享,希望它可以帮助别人。
#!/usr/bin/env Python3
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re
# Download IMDB's Top 250 data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]
imdb = []
# Store each item into dictionary (data), then put those into a list (imdb)
for index in range(0, len(movies)):
# Seperate movie into: 'place', 'title', 'year'
# Instead of "2. The Godfather (1972)"
movie_string = movies[index].get_text()
movie = (' '.join(movie_string.split()).replace('.', ''))
movie_title = movie[len(str(index))+1:-7]
year = re.search('\((.*?)\)', movie_string).group(1)
place = movie[:len(str(index))-(len(movie))]
data = {"movie_title": movie_title,
"year": year,
"place": place,
"star_cast": crew[index],
"rating": ratings[index],
"vote": votes[index],
"link": links[index]}
imdb.append(data)
# Print out some info
for item in imdb:
print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:', item['star_cast'])