我正在尝试抓取文本文件“ tastyrecipes”中列出的网站列表,我目前有一个用于返回网址的for循环,但无法弄清楚如何将网址放入request.get ()而不会出现404错误。这些网站分别返回200状态代码,查看HTML没问题。
我曾经尝试过字符串格式化
with open('tastyrecipes', 'r') as f:
for i in f:
source = requests.get("{0}".format(i))
但这并没有改变结果。
with open('tastyrecipes', 'r') as f:
new_file = open("recipecorpus.txt", "a+")
for i in f:
source = requests.get(i)
content = source.content
soup = BeautifulSoup(content, 'lxml')
list_object = soup.find('ol', class_='prep-steps list-unstyled xs-text-3')
method = list_object.text
new_file.write(method)
new_file.close()
我希望我允许对文本文件中的URL进行迭代抓取,但是它返回404错误。
答案 0 :(得分:1)
文件i
中的行f
带有尾随换行符,这些换行符不属于普通URL。在将i = i.rstrip('\r\n')
传递到i
之前,需要用requests.get()
删除换行符。
答案 1 :(得分:0)
我不可能发现requests.get
本身存在问题。
import requests
recipes=['https://tasty.co/recipe/deep-fried-ice-cream-dogs',
'https://tasty.co/recipe/fried-shrimp-and-mango-salsa-hand-rolls',
'https://tasty.co/recipe/brigadeiros']
print(list(map(requests.get, recipes))) [<Response [200]>, <Response [200]>, <Response [200]>] for recipe in recipes: print(requests.get(recipe)) <Response [200]> <Response [200]> <Response [200]>
如果网址不正确,这是一个合理的答案。
tastyrecipes
文件中的\ n和空格那是@jwodder的suggested
答案 2 :(得分:0)
第一个检查网址是否有效
from urlparse import urlsplit
def is_valid_url(url=''):
url_parts = urlsplit(url)
return url_parts.scheme and url_parts.netloc and surl_partsp.path
with open('tastyrecipes', 'r') as f:
new_file = open("recipecorpus.txt", "a+")
for i in f:
if is_valid_url(i)
source = requests.get(i)
content = source.content
soup = BeautifulSoup(content, 'lxml')
list_object = soup.find('ol', class_='prep-steps list-unstyled xs-text-3')
method = list_object.text
new_file.write(method)
new_file.close()