Web Scraping href无法识别

时间:2018-05-30 16:08:25

标签: python web-scraping

我在另一个网站上运行了类似的代码,但它确实有效,但在opensubtitle.org上我遇到了问题!我不知道为什么它无法识别href(我需要的链接)和标题。

import requests 
from bs4 import BeautifulSoup
URL = 'https://www.opensubtitles.org/it/search/sublanguageid-eng/searchonlymovies-on/genre-horror/movielanguage-english/moviecountry-usa/subformat-srt/hd-on/offset-4040'

def scarica_pagina(link):
    page = requests.get(link)
    soup = BeautifulSoup(page.text, 'lxml')
    cnt=0
    for film in soup.find(id="search_results").find_all("td"):
        cnt=cnt+1
        link = film.find("a")["href"]
        title = film.find("a").text
        #genres = film.find("i").text
        print(link)

if __name__ == '__main__':
    scarica_pagina(URL)

1 个答案:

答案 0 :(得分:0)

您只需要正确地遵循DOM:
1 - 首先选择id ='search_results'的表格 2 - 查找所有类名为'sb_star_odd'或'sb_star_even'的td标签。
3 - 找到你想要的所有链接的所有('a')[0] ['href'] 4 - 找到你想要的标题的所有('a')[0] .text。

import requests 
from bs4 import BeautifulSoup
URL = 'https://www.opensubtitles.org/it/search/sublanguageid-eng/searchonlymovies-on/genre-horror/movielanguage-english/moviecountry-usa/subformat-srt/hd-on/offset-4040'

def scarica_pagina(link):
    page = requests.get(link)
    soup = BeautifulSoup(page.text, 'lxml')
    cnt=0
    for film in soup.find(id="search_results").find_all('td',class_=re.compile('^sb*')):
        cnt=cnt+1
        link = film.find_all('a')[0]['href']
        title = film.find_all('a')[0].text
        print(link)

if __name__ == '__main__':
    scarica_pagina(URL)

你使用find代替find_all导致你的问题。