所以我试图获取范围内的所有网址,其页面包含“Recipes from from”或“Recipe from”这个词。这会复制到文件的所有链接,直到大约7496,然后它会吐出HTTPError 404.我做错了什么?我试图实现BeautifulSoup和请求,但我仍然无法让它工作。
import urllib2
with open('recipes.txt', 'w+') as f:
for i in range(14477):
url = "http://www.tastingtable.com/entry_detail/{}".format(i)
page_content = urllib2.urlopen(url).read()
if "Recipe adapted from" in page_content:
print url
f.write(url + '\n')
elif "Recipe from" in page_content:
print url
f.write(url + '\n')
else:
pass
答案 0 :(得分:1)
您尝试抓取的某些网址不存在。通过忽略例外,或许只是跳过:
import urllib2
with open('recipes.txt', 'w+') as f:
for i in range(14477):
url = "http://www.tastingtable.com/entry_detail/{}".format(i)
try:
page_content = urllib2.urlopen(url).read()
except urllib2.HTTPError as error:
if 400 < error.code < 500:
continue # not found, unauthorized, etc.
raise # other errors we want to know about
if "Recipe adapted from" in page_content or "Recipe from" in page_content:
print url
f.write(url + '\n')