这是我到目前为止所拥有的链接:
from bs4 import BeautifulSoup
import urllib.request
import re
diabetesFile = urllib.request.urlopen("http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?referrer=http://www.diabetes.org/mfa-recipes/recipes/")
diabetesHtml = diabetesFile.read()
diabetesFile.close()
soup = BeautifulSoup((diabetesHtml), "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
find = re.compile('/recipes/20(.*?)"')
searchRecipe = re.search(find, str(link))
recipe = searchRecipe.group(1)
print (recipe)
这是其中一个页面的示例:
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://www.diabetes.org/mfa-recipes/recipes/2017-02-dijon-chicken-and-broccoli-and-noodles.html').read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
for div in soup.find_all('div', class_='ingredients'):
print(div.text)
for div in soup.find_all('div', class_='nutritional_info'):
print(div.text)
for div in soup.find_all('div', class_='instructions'):
print(div.text)
我的主要目标是使用第一部分代码中的网站,从所有680页获取所有链接,然后进入每一个,并收集第二部分代码中提供的信息。最后,我正在尝试将此信息写入文本文件。提前谢谢!
答案 0 :(得分:0)
我不打算为你写整个刮刀,但这里列出了你可以做的事情:
以下是包裹:
from bs4 import BeautifulSoup
import requests # i use this one instead of urllib
import re
获取页面的代码。
req = requests.Session()
sauce = req.get('http://www.diabetes.org/mfa-recipes/recipes/2017-02-dijon-chicken-and-broccoli-and-noodles.html').read()
soup = BeautifulSoup(sauce, 'lxml')
search_link = re.compile("/recipes/20") # you might need to escape the slashes, i don't remember
all_links_find = soup.find_all("a", href=search_link)
all_links_get = [link.get_text(strip=True) for link in all_links_find]
根据href,您应该将其值附加到基本网址,如果它以http://thebaseurl/therestOftheLink开头,则您不需要执行任何操作,否则您应该执行以下操作:
all_links = [baseurl + link for link in all_links_get]
对于其他页面,您可以使用find
方法为上述div重复上述逻辑,但这次不使用get_text(strip=True)
,而应使用类似get_text("\n", strip=True)
的内容进行漂亮打印< / p>