从 <p> 中抓取文本

时间:2021-05-25 08:19:01

标签: python web-scraping beautifulsoup

This is text that I want to scrape

This is the HTML code for it

soup = BeautifulSoup(html_text, 'html.parser')
p_tags = soup.find_all('p')[15:24]
for p_tag in p_tags:
    for b in p_tags.find_all('b'):
        data = b.string
        print(data)

上面的代码什么都不返回,但也没有给出错误。需要进行哪些更改?

3 个答案:

答案 0 :(得分:2)

要获得所需的列表,您可以使用下一个示例:

import requests
from bs4 import BeautifulSoup


url = "https://www.the-future-of-commerce.com/2020/03/20/brands-with-the-best-customer-service/"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

h2 = soup.find("h2", text="Top 10 brands with the best customer service")
for row in h2.find_next_siblings(
    lambda tag: tag.name == "p"
    and [t.name for t in tag.find_all()] == ["b", "span"]
):
    b = row.b.get_text(strip=True)
    span = row.span.get_text(strip=True)
    print("{:<30} {}".format(b, span))

打印:

1. Disney Cruise Line:         Service Score –– 9.59 out of 10
2. See’s Candies:              Service Score –– 9.38 out of 10
3. Justice:                    Service Score –– 9.24 out of 10
4. Lands’ End:                 Service Score –– 9.18 out of 10
5. Chick-fil-a:                Service Score –– 9.11 out of 10
6. Publix:                     Service Score –– 9.07 out of 10
7. Vitacost:                   Service Score –– 9.04 out of 10
8. Avon:                       Service Score –– 9.02 out of 10
9. Morton’s The Steakhouse:    Service Score –– 9.02 out of 10
10. Cracker Barrel:            Service Score –– 9.01 out of 10

或者:

for span in soup.select("b + span"):
    if not "Service Score" in span.text:
        continue
    print(
        span.find_previous("b").text, span.text.replace("Service Score –– ", "")
    )

打印:

1. Disney Cruise Line:  9.59 out of 10
2. See’s Candies:  9.38 out of 10
3. Justice:  9.24 out of 10
4. Lands’ End:  9.18 out of 10
5. Chick-fil-a:  9.11 out of 10
6. Publix:  9.07 out of 10
7. Vitacost:  9.04 out of 10
8. Avon:  9.02 out of 10
9. Morton’s The Steakhouse:  9.02 out of 10
10. Cracker Barrel:  9.01 out of 10

答案 1 :(得分:0)

第二个循环提取不必要的 b 标签。您有一个 p 标签列表,其中只有一个 b 和一个 span 标签。您只需运行一个循环即可提取所有 p 标签,然后使用 b 提取 spanp.find('b') 标签。

这是一个小负载的例子。

from bs4 import BeautifulSoup

soup = BeautifulSoup('<div class="post-single-content selectionShareable"> <p><b>1. Disney Cruise Line:</b><span style="font-weight:400;"> Service Score –– 9.59 out of 10</span></p><p><b>2. See’s Candies: </b><span style="font-weight:400;">Service Score –– 9.38 out of 10</span></p><p><b>3. Justice:</b><span style="font-weight:400;"> Service Score –– 9.24 out of 10</span></p></div>', "html.parser")

p_tags = list(soup.find_all('p'))


for p in p_tags:
    b_tags = p.find('b')
    span_tags = p.find('span')

    b_text = b_tags.getText() if b_tags else ""
    span_text = span_tags.getText() if span_tags else ""

    print(b_text + span_text)

答案 2 :(得分:0)

from .ui.MultipleChoiceValueWidget_ui import Ui_MultipleChoiceValueWidget

印刷品

for p_tag in (p_tags := soup.find_all(lambda tag: tag.name == "p" and "Service Score" in tag.text)):
    print(p_tag.text.replace(" Service Score ––", ""))