我正在尝试使用 beautifulsoup4 抓取网站并请求库

时间:2021-05-04 15:18:31

标签: python beautifulsoup

我想从本网站提取电影名称、电影年份和时长

代码如下:

import requests
from bs4 import BeautifulSoup

URL = 'https://www4.f2movies.to'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

#Trending Movies
Movies = []
Year = []
Length = []

for a in soup.findAll('a', href=True, attrs={'class':"film-detail film-detail-fix"}):
    name=data.find('div', href=True, attrs={'class':'film-name'})
    year=data.find('span', href=True, attrs={'class':'fdi-item'})
    length=data.find('span', href=True, attrs={'class':'fdi-item fdi-duration'})
    Movies.append(name.text)
    Year.append(year.text)
    Length.append(length.text)

print(Movies)
print(Year)
print(Length)

我得到的结果如下:

(Projects) anildhage@xxx-MacBook-Air WebScrape % python scrape.py
[]
[]
[]
(Projects) anildhage@xxx-MacBook-Air WebScrape % 

谁能建议我哪里出错了? TIA

1 个答案:

答案 0 :(得分:0)

您在使用 find() 时的某些选择器是不正确的。要获取所有数据,请使用以下示例:

import requests
from bs4 import BeautifulSoup

URL = "https://www4.f2movies.to"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

# Trending Movies
Movies = []
Year = []
Length = []

for data in soup.findAll("div", attrs={"class": "film-detail film-detail-fix"}):
    name = data.find("h3", attrs={"class": "film-name"})
    year = data.find("span", attrs={"class": "fdi-item"})
    length = data.find("span", attrs={"class": "fdi-item fdi-duration"})
    if not length:
        continue

    Movies.append(name.text.strip())
    Year.append(year.text)
    Length.append(length.text)


print(Movies)
print(Year)
print(Length)

输出:

["Tom Clancy's Without Remorse", 'The Mitchells vs. The Machines', 'Mortal Kombat', 'Things Heard & Seen', 'Demon Slayer the Movie: Mugen Train', 'Voyagers', 'Tom & Jerry', 'Godzilla vs. Kong', 'Justice Society: World War II', 'Nomadland', 'The Virtuoso', 'Shadow in the Cloud', 'Nobody', 'Skylines', "Zack Snyder's Justice League", 'Stowaway', '22 vs. Earth', 'The Marksman', 'The Little Things', 'Wonder Woman 1984', 'Raya and the Last Dragon', 'The Father', 'SAS: Red Notice', 'Come True', 'The Lockdown Hauntings', 'The Bike Thief', 'Generation Por Que', 'Adolescents of Chymera', 'The Darkness', 'The Rise of Sir Longbottom', 'Mexican Moon', "She was the Deputy's Wife", '100m Criminal Conviction', 'Percy', 'The Mitchells vs. The Machines', 'Zombie with a Shotgun', 'Things Heard & Seen', 'Golden Arm', 'Bang! Bang!', 'Colors of Love', 'Three Pints and a Rabbi', 'Eat Wheaties!', "Before I'm Dead", '22 vs. Earth', 'The Outside Story', 'Voyagers', 'Ape vs. Monster', 'Pipeline']
['2021', '2021', '2021', '2021', '2020', '2021', '2021', '2021', '2021', '2020', '2021', '2020', '2021', '2020', '2021', '2021', '2021', '2021', '2021', '2020', '2021', '2020', '2021', '2021', '2021', '2020', '2021', '2021', '2021', '2021', '2021', '2021', '2021', '2021', '2021', '2019', '2021', '2021', '2020', '2021', '2021', '2020', '0000', '2021', '2021', '2021', '2021', '2021']
['109m', '113m', '110m', '121m', '117m', '108m', '90m', '113m', 'N/A', '108m', '105m', '83m', '92m', '110m', '242m', '116m', '5m', '108m', '127m', '151m', '112m', '97m', '120m', '105m', '101m', '79m', 'N/A', '81m', 'N/A', '73m', '84m', '95m', '92m', '109m', '113m', '79m', '121m', '90m', '71m', '110m', '85m', 'N/A', '83m', '5m', '85m', '108m', '90m', '85m']