如何阅读和打开BeautifulSoup中的每个链接,然后打印出某些数据?

时间:2017-03-05 09:02:21

标签: python web-scraping beautifulsoup

代码:

from bs4 import BeautifulSoup
import urllib.request
import sys
import time
import re



for num in range(680):
    address = ('http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num))
    html = urllib.request.urlopen(address).read()
    soup = BeautifulSoup((html), "html.parser")

    for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
        find = re.compile('/recipes/20(.*?)"')
        searchRecipe = re.search(find, str(link))
        recipe = searchRecipe.group(1)
        urllinks = ('http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe))
        urllinks = urllinks.replace(" ","")
        outfile = open('C:/recipes/recipe.txt', 'a')
        outfile.write(str(urllinks) + '\n')


f = open('recipe.txt', 'r')
for line in f.readlines():
    id = line.strip('\n')
    url = "urllinks".format(id)

    html_two = urllib.request.urlopen(url).read()
    soup_two = BeautifulSoup((html_two), "html.parser")
    for div in soup.find_all('div', class_='ingredients'):
        print(div.text)
    for div in soup.find_all('div', class_='nutritional_info'):
        print(div.text)
    for div in soup.find_all('div', class_='instructions'):
        print(div.text)

第一部分(以outfile结尾)确实有效但第二部分没有。我知道这一点,因为当我运行该程序时,它会存储所有链接但在此之后不会执行任何其他操作。对于第二部分,我正在尝试打开文件“recipe.txt”并转到每个链接并抓取某些数据(成分,nutrition _info和说明)。

2 个答案:

答案 0 :(得分:0)

f = open('C:/recipes/recipe.txt', 'r')
for line in f.readlines():
    wholeline = line.strip()
    # url = "urllinks".format(wholeline) Don't know what's this supposed to do ?

    html_two = urllib.request.urlopen(wholeline).read()
    soup_two = BeautifulSoup((html_two), "html.parser")
    for div in soup_two.find_all('div', class_='ingredients'):
        print(div.text)
    for div in soup_two.find_all('div', class_='nutritional_info'):
        print(div.text)
    for div in soup_two.find_all('div', class_='instructions'):
        print(div.text)

您在原始代码中使用了相同的变量两次而不是soup_two而是>汤。 你已经剥离了它,没有必要格式化它。

答案 1 :(得分:0)

所以我已经修改了你的代码了。首先,我建议使用requests代替urllib,因为它使用起来非常简单(What are the differences between the urllib, urllib2, and requests module?)。

其次,使用with语句打开文件。然后,您不必担心在正确的位置关闭文件(What is the python "with" statement designed for?)。

第三,我认为某些方法名称在bs4中已更改,因此请使用find_all代替findAllhttps://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names)。我没有留下这些,你可以自己改变它们。

另一个注意事项,不要使用idfind之类的特殊名称作为变量名,因为它们在Python中保留用于特殊用途(例如find是一个函数) 。

from bs4 import BeautifulSoup
import requests
import sys
import time
import re


with open('file_with_links', 'w+') as f:
    for num in range(860):
        address = 'http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num)
        html = requests.get(address).content
        soup = BeautifulSoup(html, "html.parser")

        for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
            print link
            find_urls = re.compile('/recipes/20(.*?)"')
            searchRecipe = re.search(find_urls, str(link))
            recipe = searchRecipe.group(1)
            urllinks = 'http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe)
            urllinks = urllinks.replace(" ", "")
            f.write(urllinks + '\n')

with open('file_with_links', 'r') as f:
    for line in f:
        url = line.strip()
        print url
        html_two = requests.get(url).content
        soup_two = BeautifulSoup(html_two, "html.parser")
        for div in soup_two.find_all('div', class_='ingredients'):
            print(div.text)
        for div in soup_two.find_all('div', class_='nutritional_info'):
            print(div.text)
        for div in soup_two.find_all('div', class_='instructions'):
            print(div.text)

对未来的另一个重要建议:尝试理解代码中的每一行以及它正在做什么完全