如何使用beautifulsoup从网站上的表格中获取多个信息

时间:2017-11-29 19:04:02

标签: python-3.x web-scraping beautifulsoup

我试图弄清楚如何从https://www.fda.gov/Safety/Recalls/网站提取我想要的多条信息

import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.fda.gov/Safety/Recalls/")
soup = BeautifulSoup(res.text, "lxml")

for item in soup.select("table td"):
    if "Undeclared" in item.text:
        brand = item.find_parents()[0].select("td")[1].text
        reason = item.text
        print(brand,reason)

如何从html获取brand_link?

1 个答案:

答案 0 :(得分:1)

我想这是你的预期输出:

import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.fda.gov/Safety/Recalls/")
soup = BeautifulSoup(res.text, "lxml")

for item in soup.select("table td"):
    if "Undeclared" in item.text:
        brand = item.find_parents()[0].select("td")[1].text
        reason = item.text
        print(brand,reason)

部分输出:

N/A   Undeclared Milk
Colorado Nut Company and various other private labels   Undeclared milk
All Natural, Weis, generic   Undeclared milk
Dilettante Chocolates   Undeclared almonds
Hot Pockets   Undeclared egg, milk, soy, and wheat
Figiâs   Undeclared Milk
Germack   Undeclared Milk

如果您想获得品牌名称的链接,您可以执行以下操作:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "https://www.fda.gov/Safety/Recalls/"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")

for item in soup.select("table td"):
    if "Undeclared" in item.text:
        brand = item.find_parents()[0].select("td")[1].text
        brand_link = urljoin(url,item.find_parents()[0].select("td")[1].select("a")[0]['href'])
        reason = item.text
        print("Brand: {}\nBrand_link: {}\nReason: {}\n".format(brand,brand_link,reason))

输出:

Brand: N/A  
Brand_link: https://www.fda.gov/Safety/Recalls/ucm587012.htm
Reason: Undeclared Milk