我在另一个网站上运行了类似的代码,但它确实有效,但在opensubtitle.org上我遇到了问题!我不知道为什么它无法识别href
(我需要的链接)和标题。
import requests
from bs4 import BeautifulSoup
URL = 'https://www.opensubtitles.org/it/search/sublanguageid-eng/searchonlymovies-on/genre-horror/movielanguage-english/moviecountry-usa/subformat-srt/hd-on/offset-4040'
def scarica_pagina(link):
page = requests.get(link)
soup = BeautifulSoup(page.text, 'lxml')
cnt=0
for film in soup.find(id="search_results").find_all("td"):
cnt=cnt+1
link = film.find("a")["href"]
title = film.find("a").text
#genres = film.find("i").text
print(link)
if __name__ == '__main__':
scarica_pagina(URL)
答案 0 :(得分:0)
您只需要正确地遵循DOM:
1 - 首先选择id ='search_results'的表格
2 - 查找所有类名为'sb_star_odd'或'sb_star_even'的td标签。
3 - 找到你想要的所有链接的所有('a')[0] ['href']
4 - 找到你想要的标题的所有('a')[0] .text。
import requests
from bs4 import BeautifulSoup
URL = 'https://www.opensubtitles.org/it/search/sublanguageid-eng/searchonlymovies-on/genre-horror/movielanguage-english/moviecountry-usa/subformat-srt/hd-on/offset-4040'
def scarica_pagina(link):
page = requests.get(link)
soup = BeautifulSoup(page.text, 'lxml')
cnt=0
for film in soup.find(id="search_results").find_all('td',class_=re.compile('^sb*')):
cnt=cnt+1
link = film.find_all('a')[0]['href']
title = film.find_all('a')[0].text
print(link)
if __name__ == '__main__':
scarica_pagina(URL)
你使用find代替find_all导致你的问题。