我试图弄清楚如何从https://www.fda.gov/Safety/Recalls/网站提取我想要的多条信息
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.fda.gov/Safety/Recalls/")
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("table td"):
if "Undeclared" in item.text:
brand = item.find_parents()[0].select("td")[1].text
reason = item.text
print(brand,reason)
如何从html获取brand_link?
答案 0 :(得分:1)
我想这是你的预期输出:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.fda.gov/Safety/Recalls/")
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("table td"):
if "Undeclared" in item.text:
brand = item.find_parents()[0].select("td")[1].text
reason = item.text
print(brand,reason)
部分输出:
N/A Undeclared Milk
Colorado Nut Company and various other private labels Undeclared milk
All Natural, Weis, generic Undeclared milk
Dilettante Chocolates Undeclared almonds
Hot Pockets Undeclared egg, milk, soy, and wheat
Figiâs Undeclared Milk
Germack Undeclared Milk
如果您想获得品牌名称的链接,您可以执行以下操作:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://www.fda.gov/Safety/Recalls/"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("table td"):
if "Undeclared" in item.text:
brand = item.find_parents()[0].select("td")[1].text
brand_link = urljoin(url,item.find_parents()[0].select("td")[1].select("a")[0]['href'])
reason = item.text
print("Brand: {}\nBrand_link: {}\nReason: {}\n".format(brand,brand_link,reason))
输出:
Brand: N/A
Brand_link: https://www.fda.gov/Safety/Recalls/ucm587012.htm
Reason: Undeclared Milk