代码:
from bs4 import BeautifulSoup
import urllib.request
import sys
import time
import re
for num in range(680):
address = ('http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num))
html = urllib.request.urlopen(address).read()
soup = BeautifulSoup((html), "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
find = re.compile('/recipes/20(.*?)"')
searchRecipe = re.search(find, str(link))
recipe = searchRecipe.group(1)
urllinks = ('http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe))
urllinks = urllinks.replace(" ","")
outfile = open('C:/recipes/recipe.txt', 'a')
outfile.write(str(urllinks) + '\n')
f = open('recipe.txt', 'r')
for line in f.readlines():
id = line.strip('\n')
url = "urllinks".format(id)
html_two = urllib.request.urlopen(url).read()
soup_two = BeautifulSoup((html_two), "html.parser")
for div in soup.find_all('div', class_='ingredients'):
print(div.text)
for div in soup.find_all('div', class_='nutritional_info'):
print(div.text)
for div in soup.find_all('div', class_='instructions'):
print(div.text)
第一部分(以outfile结尾)确实有效但第二部分没有。我知道这一点,因为当我运行该程序时,它会存储所有链接但在此之后不会执行任何其他操作。对于第二部分,我正在尝试打开文件“recipe.txt”并转到每个链接并抓取某些数据(成分,nutrition _info和说明)。
答案 0 :(得分:0)
f = open('C:/recipes/recipe.txt', 'r')
for line in f.readlines():
wholeline = line.strip()
# url = "urllinks".format(wholeline) Don't know what's this supposed to do ?
html_two = urllib.request.urlopen(wholeline).read()
soup_two = BeautifulSoup((html_two), "html.parser")
for div in soup_two.find_all('div', class_='ingredients'):
print(div.text)
for div in soup_two.find_all('div', class_='nutritional_info'):
print(div.text)
for div in soup_two.find_all('div', class_='instructions'):
print(div.text)
您在原始代码中使用了相同的变量两次而不是soup_two而是>汤。 你已经剥离了它,没有必要格式化它。
答案 1 :(得分:0)
所以我已经修改了你的代码了。首先,我建议使用requests
代替urllib
,因为它使用起来非常简单(What are the differences between the urllib, urllib2, and requests module?)。
其次,使用with
语句打开文件。然后,您不必担心在正确的位置关闭文件(What is the python "with" statement designed for?)。
第三,我认为某些方法名称在bs4中已更改,因此请使用find_all
代替findAll
(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names)。我没有留下这些,你可以自己改变它们。
另一个注意事项,不要使用id
,find
之类的特殊名称作为变量名,因为它们在Python中保留用于特殊用途(例如find
是一个函数) 。
from bs4 import BeautifulSoup
import requests
import sys
import time
import re
with open('file_with_links', 'w+') as f:
for num in range(860):
address = 'http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num)
html = requests.get(address).content
soup = BeautifulSoup(html, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
print link
find_urls = re.compile('/recipes/20(.*?)"')
searchRecipe = re.search(find_urls, str(link))
recipe = searchRecipe.group(1)
urllinks = 'http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe)
urllinks = urllinks.replace(" ", "")
f.write(urllinks + '\n')
with open('file_with_links', 'r') as f:
for line in f:
url = line.strip()
print url
html_two = requests.get(url).content
soup_two = BeautifulSoup(html_two, "html.parser")
for div in soup_two.find_all('div', class_='ingredients'):
print(div.text)
for div in soup_two.find_all('div', class_='nutritional_info'):
print(div.text)
for div in soup_two.find_all('div', class_='instructions'):
print(div.text)
对未来的另一个重要建议:尝试理解代码中的每一行以及它正在做什么完全。