Question

代码：

from bs4 import BeautifulSoup
import urllib.request
import sys
import time
import re



for num in range(680):
    address = ('http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num))
    html = urllib.request.urlopen(address).read()
    soup = BeautifulSoup((html), "html.parser")

    for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
        find = re.compile('/recipes/20(.*?)"')
        searchRecipe = re.search(find, str(link))
        recipe = searchRecipe.group(1)
        urllinks = ('http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe))
        urllinks = urllinks.replace(" ","")
        outfile = open('C:/recipes/recipe.txt', 'a')
        outfile.write(str(urllinks) + '\n')


f = open('recipe.txt', 'r')
for line in f.readlines():
    id = line.strip('\n')
    url = "urllinks".format(id)

    html_two = urllib.request.urlopen(url).read()
    soup_two = BeautifulSoup((html_two), "html.parser")
    for div in soup.find_all('div', class_='ingredients'):
        print(div.text)
    for div in soup.find_all('div', class_='nutritional_info'):
        print(div.text)
    for div in soup.find_all('div', class_='instructions'):
        print(div.text)

第一部分（以outfile结尾）确实有效但第二部分没有。我知道这一点，因为当我运行该程序时，它会存储所有链接但在此之后不会执行任何其他操作。对于第二部分，我正在尝试打开文件“recipe.txt”并转到每个链接并抓取某些数据（成分，nutrition _info和说明）。

Answer 1

f = open('C:/recipes/recipe.txt', 'r')
for line in f.readlines():
    wholeline = line.strip()
    # url = "urllinks".format(wholeline) Don't know what's this supposed to do ?

    html_two = urllib.request.urlopen(wholeline).read()
    soup_two = BeautifulSoup((html_two), "html.parser")
    for div in soup_two.find_all('div', class_='ingredients'):
        print(div.text)
    for div in soup_two.find_all('div', class_='nutritional_info'):
        print(div.text)
    for div in soup_two.find_all('div', class_='instructions'):
        print(div.text)

您在原始代码中使用了相同的变量两次而不是soup_two而是＆gt;汤。你已经剥离了它，没有必要格式化它。

Answer 2

所以我已经修改了你的代码了。首先，我建议使用requests代替urllib，因为它使用起来非常简单（What are the differences between the urllib, urllib2, and requests module?）。

其次，使用with语句打开文件。然后，您不必担心在正确的位置关闭文件（What is the python "with" statement designed for?）。

第三，我认为某些方法名称在bs4中已更改，因此请使用find_all代替findAll（https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names）。我没有留下这些，你可以自己改变它们。

另一个注意事项，不要使用id，find之类的特殊名称作为变量名，因为它们在Python中保留用于特殊用途（例如find是一个函数）。

from bs4 import BeautifulSoup
import requests
import sys
import time
import re


with open('file_with_links', 'w+') as f:
    for num in range(860):
        address = 'http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num)
        html = requests.get(address).content
        soup = BeautifulSoup(html, "html.parser")

        for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
            print link
            find_urls = re.compile('/recipes/20(.*?)"')
            searchRecipe = re.search(find_urls, str(link))
            recipe = searchRecipe.group(1)
            urllinks = 'http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe)
            urllinks = urllinks.replace(" ", "")
            f.write(urllinks + '\n')

with open('file_with_links', 'r') as f:
    for line in f:
        url = line.strip()
        print url
        html_two = requests.get(url).content
        soup_two = BeautifulSoup(html_two, "html.parser")
        for div in soup_two.find_all('div', class_='ingredients'):
            print(div.text)
        for div in soup_two.find_all('div', class_='nutritional_info'):
            print(div.text)
        for div in soup_two.find_all('div', class_='instructions'):
            print(div.text)

对未来的另一个重要建议：尝试理解代码中的每一行以及它正在做什么完全。

如何阅读和打开BeautifulSoup中的每个链接，然后打印出某些数据？

2 个答案: