使用 bs4 进行网页抓取

时间:2021-06-29 08:06:08

标签: python web-scraping beautifulsoup

这是我的代码

r = requests.get(base_url)
all_para = ""
soup = bs4.BeautifulSoup(r.text,'html.parser')
for iteri in range(len(headers)):
    deet = soup.find('h3', text = headers[iteri]) # Search for div tags of class 'entry-content content'
    for para in deet.find_next_siblings(): # Within these tags, find all p tags
        if para.name == "h2" or para.name == "h3":
            break
        elif para.name == "p":
            all_para += para.get_text()
            all_para += '\n'

但是我遇到了错误

AttributeError                            Traceback (most recent call last)
<ipython-input-17-ed21c6e3d415> in <module>
      4 for iteri in range(len(headers)):
      5     deet = soup.find('h3', text = headers[iteri]) # Search for div tags of class 'entry-content content'
----> 6     for para in deet.find_next_siblings(): # Within these tags, find all p tags
      7         if para.name == "h2" or para.name == "h3":
      8             break

AttributeError: 'NoneType' object has no attribute 'find_next_siblings'

但我不知道为什么会出现此错误。

2 个答案:

答案 0 :(得分:0)

deet 好像是 None,循环前检查一下

deet = soup.find('h3', text = headers[iteri]) # None returned

答案 1 :(得分:0)

要获取所有 <h3> 标头和其下的 <p>,您可以将 .find_next_siblings.find_previous 结合使用:

import requests
from bs4 import BeautifulSoup
from textwrap import wrap

url = "https://en.wikipedia.org/wiki/Wikipedia"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

out = []
for h3 in soup.select("h3"):
    ps = [
        p.get_text(strip=True, separator=" ")
        for p in h3.find_next_siblings("p")
        if p.find_previous("h3") == h3
    ]
    if ps:
        out.append((h3.get_text(strip=True), "\n".join(ps)))

# print the headers and paragraphs:
for h, p in out:
    print(h)
    print()
    print("\n".join(wrap(p)))
    print("-" * 80)

打印:

Nupedia

Other collaborative online encyclopedias were attempted before
Wikipedia, but none were as successful. [15] Wikipedia began as a
complementary project for Nupedia , a free online English-language
encyclopedia project whose articles were written by experts and
reviewed under a formal process. [16] It was founded on March 9, 2000,
under the ownership of Bomis , a web portal company. Its main figures
were Bomis CEO Jimmy Wales and Larry Sanger , editor-in-chief for
Nupedia and later Wikipedia. [1] [17] Nupedia was initially licensed
under its own Nupedia Open Content License, but even before Wikipedia
was founded, Nupedia switched to the GNU Free Documentation License at
the urging of Richard Stallman . [18] Wales is credited with defining
the goal of making a publicly editable encyclopedia, [19] [20] while
Sanger is credited with the strategy of using a wiki to reach that
goal. [21] On January 10, 2001, Sanger proposed on the Nupedia mailing
list to create a wiki as a "feeder" project for Nupedia. [22]
--------------------------------------------------------------------------------
Launch and early growth

The domains wikipedia.com (redirecting to wikipedia.org ) and
wikipedia.org were registered on January 12, 2001, [23] and January
13, 2001, [24] respectively, and Wikipedia was launched on January 15,


...and so on.