如何使用BeautifulSoup抓取文本标签?

时间:2020-10-08 08:46:55

标签: python html beautifulsoup

我是BS4和网络爬虫的新手,所以对于这样的基本问题,我们事先表示歉意。

我正在抓捕Beer Advocate网站(https://www.beeradvocate.com/beer/?view=recent),但我不知道如何获取ABV内容,主要是因为我不确定我应该使用哪个标签。根据HTML工具,标记为#text,但是我不确定如何处理。

有人知道如何提取此信息吗?

谢谢。

enter image description here

2 个答案:

答案 0 :(得分:0)

要获取酒精含量和啤酒品牌,您可以使用以下示例:

import re
from bs4 import BeautifulSoup
import requests

url = 'https://www.beeradvocate.com/beer/?view=recent'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

r = re.compile(r'([\d.]+)% ABV$')
for t in soup.find_all(text=r):
    name = t.find_previous('h6').text
    amount = r.search(t).group(1)
    print('{:<50} {}%'.format(name, amount))

打印:

Granát (BrouCzech Dark)                            5%
HopTime Harvest Ale                                6%
Direct Current                                     6.8%
Hopzilla Double IPA                                8.7%
Dankful                                            7.4%
Cancun Commie                                      11.5%
Welcome Young One                                  8.2%
Lick the Spoon                                     12%
Split Open and Melt                                8%
Speedway Stout                                     12%
What Mask?                                         8.4%
Switch Lanes                                       7%
Hella Juice Bag                                    8.2%
Down By The River                                  4.9%
Road Town                                          7.5%
Manhattan Social Club                              12.5%
Flash Kick                                         8.2%
Naked Brunch                                       8.5%
Tiki Breeze                                        7%
Oberon - Mango                                     5.8%
Eldest Brother                                     11%
Bliss                                              8%
Watou Tripel                                       7.5%
Respect Your Elders                                7.25%
Braxton Labs Smoothie Sour: Tropical               4.8%
Heaven Scent                                       5.5%
Oktoberfest                                        6.5%
Phaser                                             6.5%
Mark It Zero!                                      12%
Lake George IPA                                    6.8%
Triangled IPA (⟁)                                  8%
Broo Doo                                           7%
Porter                                             6.5%
Imperial Porter - Rum Barrel Aged w/ Coconut       7.2%
Willow                                             7.1%
State of the Art - Orange DIPA                     8.7%
Fest-Beer                                          5.9%
Boskeun                                            10%
Smuttlabs Baja Hoodie                              8.4%
Trappist Achel 8° Bruin                            8%
Double Dry Hopped Double Mosaic Dream              8.5%
Falcon Smash                                       7.4%
Hazy Wonder                                        6%
Mango Wango                                        7.5%
North Park                                         5%
The Tomb                                           10.2%
Cashmere Hammer                                    6.5%
Chonk Sundae Sour (Peanut Butter and Jelly)        4.3%
The Tearing Of Flesh From Bone                     8.2%
Oktoberfest                                        6.1%

答案 1 :(得分:0)

在这里,您可以使用bs4查找文本,然后使用正则表达式提取所有ABV匹配字符串。

from bs4 import BeautifulSoup
import re

webpage = "YOUR_WEBPAGE_STRING"

soup = BeautifulSoup(webpage, features="html.parser")
txt = soup.text

x = re.findall("^| \d+% ABV", txt)

print(x)

对于给定的链接,您将获得如下输出:

['', ' 5% ABV', ' 6% ABV', ' 12% ABV', ' 8% ABV', ' 12% ABV', ' 7% ABV', ' 7% ABV', ' 11% ABV', ' 8% ABV', ' 12% ABV', ' 8% ABV', ' 7% ABV', ' 10% ABV', ' 8% ABV', ' 6% ABV', ' 5% ABV']