Question

我正在使用from retry import retry from urllib2 import URLError @retry(URLError, tries=3) def get_url(driver): return driver.current_url def main(): # Whatever setup you have goes here # <...> if get_url(driver).split("/")[3] != "search": time.sleep(random.randint(1, 3)) driver.back() if __name__ == "__main__": main()和requests来抓取一个NBA网站。

BeautifulSoup4

该网站的网址实际上会导致＆＃39; http://www.nba.com/games/20111225/BOSNYK/gameinfo.html#nbaGIboxscore＆＃39;当它进入浏览器时，我认为使用from bs4 import BeautifulSoup import requests r = requests.get('http://www.nba.com/games/20111225/BOSNYK/boxscore.html') soup = BeautifulSoup(r.text)是模拟这个的正确方法。

问题是我不知道这种影响的关键字，也无法在线找到解决方案。

Answer 1

您可以使用regex或bs4来查找重定向的网站，然后使用requests来抓他。

例如：

import bs4
import requests

original_url = 'http://www.nba.com/games/20111225/BOSNYK/'
old_suffix = 'boxscore.html'
r = requests.get(original_url + old_suffix)
site_content = bs4.BeautifulSoup(r.text, 'lxml')
meta = site_content.find_all('meta')[0]
meta_content = meta.attrs.get('content')
new_suffix = meta.attrs.get('content')[6:]
new_url_to_scrape = original_url + new_suffix

然后抓住new_url_to_scarpe。享受！

使用请求和BeautifulSoup刮取重定向的站点

1 个答案: