BeautifulSoup找不到完整的链接

时间:2019-09-19 00:38:43

标签: python-3.x web-scraping beautifulsoup

当我尝试在网页上获取链接时,bs4不能捕获整个链接,而是在**?ref**....之前停止。
我将通过代码解释问题:

imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
site = requests.get(imdb_link)
soup = BeautifulSoup(site.text,'lxml')

for items in soup.find("table",class_="chart").find_all(class_="titleColumn"):
    link = items.find("a").get('href')
    print(link)

输出为:

/title/tt0111161/
/title/tt0068646/
/title/tt0071562/
/title/tt0468569/
/title/tt0050083/
/title/tt0108052/
/title/tt0167260/
...and so on..

但这是错误的,正如您通过查看网页可以看到的那样,因为它可能是:

/title/tt0111161/?ref_=adv_li_tt
/title/tt0068646/?ref_=adv_li_tt
...and so on...

如何获取整个链接?我是说 ?ref_=adv_li_tt 吗?

我使用Python 3.7.4

1 个答案:

答案 0 :(得分:0)

总的来说,尝试找出如何获得完整链接可能很有趣-我认为您需要硒才能使javascript在页面上运行,而不需要呈现页面上的完整链接。除了前缀https://www.imdb.com外,您所拥有的一切都可以很好地服务。

import requests
from bs4 import BeautifulSoup as bs

with requests.Session() as s:
    r = s.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
    soup = bs(r.content, 'lxml')
    links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]

    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        print(soup.select_one('title').text)

您可以让硒加载页面,以便内容呈现然后传递到bs4以获取页面上的链接:

from selenium import webdriver
from bs4 import BeautifulSoup as bs

d = webdriver.Chrome()
d.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(d.page_source, 'lxml')
d.quit()
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]