当我尝试在网页上获取链接时,bs4
不能捕获整个链接,而是在**?ref**....
之前停止。
我将通过代码解释问题:
imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
site = requests.get(imdb_link)
soup = BeautifulSoup(site.text,'lxml')
for items in soup.find("table",class_="chart").find_all(class_="titleColumn"):
link = items.find("a").get('href')
print(link)
输出为:
/title/tt0111161/
/title/tt0068646/
/title/tt0071562/
/title/tt0468569/
/title/tt0050083/
/title/tt0108052/
/title/tt0167260/
...and so on..
但这是错误的,正如您通过查看网页可以看到的那样,因为它可能是:
/title/tt0111161/?ref_=adv_li_tt
/title/tt0068646/?ref_=adv_li_tt
...and so on...
如何获取整个链接?我是说 ?ref_=adv_li_tt
吗?
我使用Python 3.7.4
答案 0 :(得分:0)
总的来说,尝试找出如何获得完整链接可能很有趣-我认为您需要硒才能使javascript在页面上运行,而不需要呈现页面上的完整链接。除了前缀https://www.imdb.com
外,您所拥有的一切都可以很好地服务。
import requests
from bs4 import BeautifulSoup as bs
with requests.Session() as s:
r = s.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(r.content, 'lxml')
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
您可以让硒加载页面,以便内容呈现然后传递到bs4以获取页面上的链接:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(d.page_source, 'lxml')
d.quit()
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]