Web剪贴时如何移入和移出网页

时间:2018-06-20 20:30:45

标签: python web-scraping beautifulsoup

这是所有以字母“ A”开头的电影的URL。我想进入每部电影,收集数据,返回,然后遍历列表,直到没有更多以字母“ A”开头的电影,然后转到下一个网页,所有电影均以“ B',依此类推,直到我到达字母的结尾。 = https://usa.newonnetflix.info/catalog/a2z/all/a

from bs4 import BeautifulSoup
import requests
import pandas as pd

r = requests.get('https://usa.newonnetflix.info/catalog/a2z/all/a')
soup = BeautifulSoup(r.text, 'html.parser')

以下代码从每部电影的特定网址获取我需要的所有数据。因此,在这里,我已经从上一个URL中单击了电影,这是我将抓取所有想要的数据的地方。 = https://usa.newonnetflix.info/info/70290905/s

title = soup.find_all('article',attrs={'class':'post infopage'})

for first in title: 
    movie = first.find('a')['title'].split(':')[0]
    category = first.find(attrs={'class':'genre'}).text
    rating = first.find(attrs={'class':'ratingsblock'}).text
    year = first.find('strong', text='Year:').next_sibling[1:]
    duration = first.find('strong', text='Duration:').next_sibling[1:]
    audio = first.find('strong', text='Audio:').next_sibling[1:]
    subtitles = first.find('strong', text='Subtitles:').next_sibling[1:]
    netflix = first.find_all(attrs={'class':'starrating'})[1]['title'][15:]
    imdb = first.find_all(attrs={'class':'starrating'})[7]['title'][12:]
    movieDB = first.find_all(attrs={'class':'starrating'})[13]['title'][26:]
    avgRating = first.find_all(attrs={'class':'starrating'})[19]['title'][15:]

我不确定如何进入新的网址并返回到先前的网址以刮取所有电影的数据。

1 个答案:

答案 0 :(得分:0)

您可以遍历字母表中的所有字母,为每个字母形成适当的网址,然后抓取结果内容:

import string
import requests, contextlib, re
from bs4 import BeautifulSoup as soup
from collections import namedtuple
film = namedtuple('film', ['title', 'tags', 'rating', 'extras', 'personel', 'ratings'])
def get_film_data(url):
   current_page = soup(requests.get(url).text, 'html.parser')
   title = [i for i in current_page.find_all('a') if 'title' in getattr(i, 'attrs', {})][0].text
   tags = current_page.find('div', {'class':'genre'}).text.split(', ')
   rating = current_page.find('span', {'class':'ratingsblock'}).text 
   data = [re.findall('[\w\W]+(?=:)|(?<=:\s)[\w\W]+', i.text) for i in current_page.find_all('p') if getattr(i, 'strong', None)][:7]
   new_data = [{a:b[0]} if b else {a:[i.text for i in current_page.find_all('p') if 'strong' not in i.attrs][6]} for a, *b in data] 
   director_and_cast = [[re.findall('[a-zA-Z]+', i['href'])[0].capitalize(), i.text] for i in current_page.find_all('a') if re.findall('View all titles \w+', i.attrs.get('title', ''))]
   ratings = list(filter(lambda x:re.findall('\d', x), {i['title'] for i in current_page.find_all('img', {'class':'starrating'})}))
   return film(title, tags, rating, new_data, director_and_cast, ratings)

@contextlib.contextmanager
def get_page_films(page='a'):
  data = soup(requests.get('https://usa.newonnetflix.info/catalog/a2z/all/{}'.format(page)).text, 'html.parser')
  links = [i['href'] for i in data.find_all('a', {'class':'infopop'})]
  yield list(map(lambda x:get_film_data('https://usa.newonnetflix.info{}/s'.format(x)), links))

for i in string.ascii_lowercase:
  with get_page_films(i) as results:
    print(results)