我编写了一个代码,用于从指定的网址中提取所有链接。我从在线视频教程中获取了这个想法。当我尝试使用nytimes.com时,如果解决了。但是当我尝试使用yell.com时,我抛出了一个错误:"错误:HTTP错误416:请求的范围不满足 - http://www.yell.com/"。我应该采用什么技术来绕过这个。
import urllib.parse;
import urllib;
from bs4 import BeautifulSoup;
##url = "http://nytimes.com";
url = "http://www.yell.com/";
urls = [url];
visited = [url];
while(len(urls) > 0):
try:
htmltext = urllib.request.urlopen(urls[0]).read();
soup = BeautifulSoup(htmltext);
urls.pop(0);
print(len(urls));
for tag in soup.findAll('a',href=True) :
tag['href'] = urllib.parse.urljoin(url,tag['href']);
if(url in tag['href'] and tag['href'] not in visited) :
urls.append(tag['href']);
visited.append(tag['href']);
except urllib.error.HTTPError as e:
print("Error: " + str(e)
+ " - " + url);
print(visited);
答案 0 :(得分:0)
这里发生的是yell.com正在检测不规则的活动。如果你尝试使用selenium进行视觉上的抓取,它将加载Javascript的方式:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load
# at this point, if you see the Firefox window that opened you will see the message
# Anyway, if you manage to get pass trough that blocking, you could load BeautifulSoup this way:
soup = BeautifulSoup(driver.page_source)