如何使用BeautifulSoup提取嵌套类的第一个实例

时间:2017-08-04 07:59:23

标签: python beautifulsoup

有多个类都共享名称“row”,在每个行类中,有多个类都共享名称“column”。

我正在尝试遍历行类,只收集每行的第一列。

然后我打印出该数据的链接内容

这样做的正确方法是什么?我试过制作一个列表,但是在创建列表后,我不再能够在对象上使用beautifulsoup函数了。

这是url的链接:

https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils

rows = soup.find_all('div', attrs={'class': 'row'})

for row in rows:
    col = row.find('div', attrs={'class': 'column'})
    link = col.find('a')
    print link.contents

1 个答案:

答案 0 :(得分:1)

看起来你需要一个cookie集才能看到子类别页面上的内容。所以,如果我理解这个问题:

import requests
from bs4 import BeautifulSoup
# You need to store cookies so use a session.
s = requests.Session()
# Reques a page to get cookie.
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories")
# Make the real request.
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils")
soup = BeautifulSoup(page.content,'html.parser') 
# Get the div.
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'})
# Get the a element text.
for div in divs:
    print (div.find('a').text)

输出:

Balsam Fir 15 ml
Balsam Fir 30 ml
Balsam Fir 5 ml
Basil Essential Oil  15ml
Basil Essential Oil  30ml
Basil Essential Oil  3ml
Basil Essential Oil  5ml
Bergamot Essential Oil  15ml
...

如果您只是想要使用正则表达式删除大小,请添加到集合中:

import requests
from bs4 import BeautifulSoup
import re
# You need to store cookies so use a session.
s = requests.Session()
# Reques a page to get cookie.
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories")
# Make the real request.
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils")
soup = BeautifulSoup(page.content,'html.parser') 
# Get the div.
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'})
# Get the a element text.
a = set()
for div in divs:
    text = div.find('a').text
    a.add(re.sub('\s*\d+\s*ml$', '', text))
print (a)

输出:

    {'Lavender, Bulgarian Essential Oil', 'Thyme, White', 'Mandarin, Red Essential Oil', 'Pine Needle Essential Oil', 'Lemongrass Essential Oil', 'Fir Needle, Siberian', 'Spruce', 'Peppermint', 'Lime Essential Oil', 'Myrrh', 'Juniper Essential Oil', 'Petitgrain', 'Wintergreen', 'Lemon Essential Oil', 'Palmarosa', 'Balsam Fir', 'Chamomile, Roman', 'Cypress', 'Citronella', 'Rosemary', 'Lemon myrtle Essential Oil', 'Clary Sage', 'Cinnamon Bark', 'Frankincense', 'Tangerine', 'Cocoa, Absolute', 'Spearmint', 'Ravensara Essential Oil', 'Spike Lavender Essential Oil', 'Hyssop', 'Ylang Ylang', 'Basil Essential Oil', 'Bergamot Essential Oil', 'Fir Needle, Siberian1', 'Geranium Bourbon', 'Patchouli', 'Black Pepper Essential Oil', 'Fennel', 'Grapefruit Essential Oil', 'Eucalyptus', 'Carrot Seed Essential Oil', 'Chamomile, German', 'Vetiver', 'Tea Tree', 'Ginger', 'Marjoram, Sweet', 'Clove Bud'}