主要代码:
from bs4 import BeautifulSoup
import urllib.request
import sys
import time
import re
import bs4 as bs
for num in range(680):
address = ('http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num))
html = urllib.request.urlopen(address).read()
soup = BeautifulSoup((html), "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
find = re.compile('/recipes/20(.*?)"')
searchRecipe = re.search(find, str(link))
recipe = searchRecipe.group(1)
urllinks = ('http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe))
urllinks = urllinks.replace(" ","")
outfile = open('C:/Users/cody/Desktop/python files/Projects/Scraper/Diabetes/recipe.txt', 'a')
outfile.write(str(urllinks) + '\n')
time.sleep(.1)
f = open('recipe.txt', 'r')
for line in f.readlines():
wholeline = line.strip()
sauce = urllib.request.urlopen(wholeline).read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
body = soup.body
for div in soup.find_all(True, {'class':['recipe_col_2','ingredients','instructions']}):
outfile = open('C:/Users/cody/Desktop/python files/Projects/Scraper/Diabetes/recipe info.txt', 'a')
outfile.write(div.text)
标题代码:
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://www.diabetes.org/mfa-recipes/recipes/2017-02-dijon-chicken-and-broccoli-and-noodles.html')
soup = bs.BeautifulSoup(sauce, 'lxml')
body = soup.body
for title in body.find_all('h1'):
print(title.text)
如何从title code
获取title.text并将其集成到我的main code
中,以便它写入同一文本文件。我也遇到了另一件事。在文本文件中,它将每个配方放下两次,我不想要任何重复,我该如何解决这个问题?
样品配方网站:
http://www.diabetes.org/mfa-recipes/recipes/2017-02-dijon-chicken-and-broccoli-and-noodles.html
答案 0 :(得分:0)
如果您只想在recipe_info.txt中添加h1标题作为每个食谱项目的开头,那么这可行:
for div in soup.find_all(True, {'class':['recipe_col_2','ingredients','instructions']}):
#select and get text from h1
title = soup.select_one('h1').getText()
outfile = open('recipe_info.txt', 'a')
#write it first
outfile.write(title)
outfile.write(div.text)
当我运行它时,这很有效,但是在第一个配方之后我遇到了另一个错误 - " UnicodeEncodeError:' ascii'编解码器不能编码字符' \ u2013'位置206:序数不在范围(128)"
如果您还遇到此问题,那么您可能需要编码为“utf-8'然后解码以读取我不会进入的文件。