如何在网站上解析食谱的标题并将其写入文本文件?

时间:2017-03-11 17:11:23

标签: python web-scraping beautifulsoup

主要代码:

from bs4 import BeautifulSoup
import urllib.request
import sys
import time
import re
import bs4 as bs



for num in range(680):
    address = ('http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num))
    html = urllib.request.urlopen(address).read()
    soup = BeautifulSoup((html), "html.parser")

    for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
        find = re.compile('/recipes/20(.*?)"')
        searchRecipe = re.search(find, str(link))
        recipe = searchRecipe.group(1)
        urllinks = ('http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe))
        urllinks = urllinks.replace(" ","")
        outfile = open('C:/Users/cody/Desktop/python files/Projects/Scraper/Diabetes/recipe.txt', 'a')
        outfile.write(str(urllinks) + '\n')
        time.sleep(.1)


        f = open('recipe.txt', 'r')
        for line in f.readlines():
            wholeline = line.strip()
            sauce = urllib.request.urlopen(wholeline).read()
            soup = bs.BeautifulSoup(sauce, 'html.parser')
            body = soup.body
            for div in soup.find_all(True, {'class':['recipe_col_2','ingredients','instructions']}):
                outfile = open('C:/Users/cody/Desktop/python files/Projects/Scraper/Diabetes/recipe info.txt', 'a')
                outfile.write(div.text)   

标题代码:

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://www.diabetes.org/mfa-recipes/recipes/2017-02-dijon-chicken-and-broccoli-and-noodles.html')
soup = bs.BeautifulSoup(sauce, 'lxml')

body = soup.body
for title in body.find_all('h1'):
    print(title.text)

如何从title code获取title.text并将其集成到我的main code中,以便它写入同一文本文件。我也遇到了另一件事。在文本文件中,它将每个配方放下两次,我不想要任何重复,我该如何解决这个问题?

样品配方网站: http://www.diabetes.org/mfa-recipes/recipes/2017-02-dijon-chicken-and-broccoli-and-noodles.html

1 个答案:

答案 0 :(得分:0)

如果您只想在recipe_info.txt中添加h1标题作为每个食谱项目的开头,那么这可行:

for div in soup.find_all(True, {'class':['recipe_col_2','ingredients','instructions']}):
    #select and get text from h1
    title = soup.select_one('h1').getText()
    outfile = open('recipe_info.txt', 'a')
    #write it first
    outfile.write(title)
    outfile.write(div.text)

当我运行它时,这很有效,但是在第一个配方之后我遇到了另一个错误 - " UnicodeEncodeError:' ascii'编解码器不能编码字符' \ u2013'位置206:序数不在范围(128)"

如果您还遇到此问题,那么您可能需要编码为“utf-8'然后解码以读取我不会进入的文件。