分析

Question

我正在尝试抓取文本文件“ tastyrecipes”中列出的网站列表，我目前有一个用于返回网址的for循环，但无法弄清楚如何将网址放入request.get （）而不会出现404错误。这些网站分别返回200状态代码，查看HTML没问题。

我曾经尝试过字符串格式化

with open('tastyrecipes', 'r') as f:
    for i in f:
        source = requests.get("{0}".format(i))

但这并没有改变结果。

with open('tastyrecipes', 'r') as f:
    new_file = open("recipecorpus.txt", "a+")
    for i in f:
        source = requests.get(i)
        content = source.content
        soup = BeautifulSoup(content, 'lxml')
        list_object = soup.find('ol', class_='prep-steps list-unstyled xs-text-3')
        method = list_object.text
        new_file.write(method)
        new_file.close()

我希望我允许对文本文件中的URL进行迭代抓取，但是它返回404错误。

Answer 1

文件i中的行f带有尾随换行符，这些换行符不属于普通URL。在将i = i.rstrip('\r\n')传递到i之前，需要用requests.get()删除换行符。

Answer 2

分析

我不可能发现requests.get本身存在问题。

import requests
recipes=['https://tasty.co/recipe/deep-fried-ice-cream-dogs',
        'https://tasty.co/recipe/fried-shrimp-and-mango-salsa-hand-rolls',
         'https://tasty.co/recipe/brigadeiros']

print(list(map(requests.get, recipes)))
[<Response [200]>, <Response [200]>, <Response [200]>]

for recipe in recipes: print(requests.get(recipe))
<Response [200]>
<Response [200]>
<Response [200]>

可能的问题

1。 404本身不是问题

如果网址不正确，这是一个合理的答案。

2。 `tastyrecipes`文件中的\ n和空格

那是@jwodder的suggested

Answer 3

第一个检查网址是否有效 from urlparse import urlsplit def is_valid_url(url=''): url_parts = urlsplit(url) return url_parts.scheme and url_parts.netloc and surl_partsp.path

with open('tastyrecipes', 'r') as f: new_file = open("recipecorpus.txt", "a+") for i in f: if is_valid_url(i) source = requests.get(i) content = source.content soup = BeautifulSoup(content, 'lxml') list_object = soup.find('ol', class_='prep-steps list-unstyled xs-text-3') method = list_object.text new_file.write(method) new_file.close()

为什么request.get（）在for循环中不起作用？

3 个答案:

分析

可能的问题

1。 404本身不是问题

2。 `tastyrecipes`文件中的\ n和空格

为什么request.get（）在for循环中不起作用？

3 个答案:

分析

可能的问题

1。 404本身不是问题

2。 tastyrecipes文件中的\ n和空格

2。 `tastyrecipes`文件中的\ n和空格