为什么beautifulsoup4的find_all()函数不能捕获所有<h3>标签

时间:2020-06-14 23:33:35

标签: python web-scraping beautifulsoup

import requests
import pprint as pp
from bs4 import BeautifulSoup as soup
headers = {
     'User-Agent': 'some_name',
        'From': 'some_email'
}
URL = 'https://www.reddit.com/r/wallstreetbets/'
page = requests.get(URL, headers = headers)
page_html = page.content

page_soup = soup(page_html, "html.parser")

print(page_soup.find_all('h3'))


print(page.status_code)
page.close()

这是我第一次使用beautifulsoup,我正在尝试学习如何使用它。由于某种原因,当我尝试获取标签时,它只会获取前8个,然后停止。我不明白如何获取每个标签。我试过指定类,但是并不能解决问题。

2 个答案:

答案 0 :(得分:2)

要获取所有链接,可以使用版本的Reddit。

例如:

import requests
from bs4 import BeautifulSoup as soup


URL = 'https://old.reddit.com/r/wallstreetbets/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',}
page_soup = soup(requests.get(URL, headers = headers).content, "html.parser")

for p in page_soup.select('p.title'):
    print(p.get_text(strip=True, separator=' '))

打印:

What Are Your Moves Tomorrow, June 15, 2020 Daily Discussion ( self.wallstreetbets )
They are getting ready for Monday. Meme ( v.redd.it )
Chill Session incoming this week Meme ( v.redd.it )
Just a bull huntin for some calls Meme ( v.redd.it )
this does not feel bullish Meme ( i.imgur.com )
I'm from the past. Here's what's going to happen. Discussion ( self.wallstreetbets )
Bulls tread lightly we're in for a gong show Discussion ( self.wallstreetbets )
I've been workin' on this meme for a while...It's about Friendship Meme ( v.redd.it )
I've got a great idea to fix my portfolio ( sound on ) OC Meme ( v.redd.it )
Welcome to the Kang Gang OC Meme ( i.redd.it )
DDDD - Retail Investors, Bankruptcies, Dark Pools and Beauty Contests OC DD ( self.wallstreetbets )
We made WSJ lol Discussion ( wsj.com )
The Great Gay Bear Trade Fundamentals ( self.wallstreetbets )
US Important news this week (est) Discussion ( self.wallstreetbets )
How George Floyd Cured COVID (and why we're never locking down again) DD ( self.wallstreetbets )
The Kang Gang Manifesto - A 2-month journey from $120k to $210k Gain ( self.wallstreetbets )
The unofficial wallstreetbets alignment chart Meme ( i.redd.it )
Bigly expirations this Friday, watch out Discussion ( self.wallstreetbets )
Amazon Set to Face Antitrust Charges in European Union Stocks ( nytimes.com )
The Convergence of Retardation and Philanthropy......Autists United, Inc. DD ( self.wallstreetbets )
Ending the Kangaroo Market (Sound On) Meme ( v.redd.it )
Hey Dontsweatit32 - hold my beer and take a ban Options ( i.redd.it )
Hewooo Retards, Carebear here warning you about the incoming Monday's rug pull. DD ( self.wallstreetbets )
DGLY Sympathy Plays Discussion ( self.wallstreetbets )
Is Apple going going to another new All Time High??? Discussion ( self.wallstreetbets )
I'm all in on spce YOLO ( self.wallstreetbets )

编辑:如果要使用新版本,可以尝试以下示例(它需要使用re / json模块来解析JavaScript):

import re
import json
import requests
from bs4 import BeautifulSoup as soup


URL = 'https://www.reddit.com/r/wallstreetbets/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',}
page_soup = soup(requests.get(URL, headers = headers).content, "html.parser")

txt = page_soup.select_one('script#data').contents[0]

data = json.loads(re.search(r'window\.___r = (.*?});', txt).group(1))

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for v in data['posts']['models'].values():
    print(v['title'])

打印:

What Are Your Moves Tomorrow, June 15, 2020
They are getting ready for Monday.
Chill Session incoming this week
Just a bull huntin for some calls
this does not feel bullish
I'm from the past. Here's what's going to happen.
Bulls tread lightly we're in for a gong show
I've been workin' on this meme for a while...It's about Friendship
DDDD - Retail Investors, Bankruptcies, Dark Pools and Beauty Contests
I've got a great idea to fix my portfolio ( sound on )
Welcome to the Kang Gang
We made WSJ lol
The Great Gay Bear Trade
US Important news this week (est)
How George Floyd Cured COVID (and why we're never locking down again)
The Kang Gang Manifesto - A 2-month journey from $120k to $210k
The unofficial wallstreetbets alignment chart
Bigly expirations this Friday, watch out
Amazon Set to Face Antitrust Charges in European Union
The Convergence of Retardation and Philanthropy......Autists United, Inc.
Ending the Kangaroo Market (Sound On)
Hey Dontsweatit32 - hold my beer and take a ban
Hewooo Retards, Carebear here warning you about the incoming Monday's rug pull.
We did it again. The second wave is coming soon and I am all in with PUTs in everything!
I'm all in on spce
DGLY Sympathy Plays

答案 1 :(得分:0)

我无法找到您的代码中的错误,但这确实对我有用

import requests
from bs4 import BeautifulSoup

url = "https://www.reddit.com/r/wallstreetbets/"
headers = {"User-Agent": "wswp"}

with requests.Session() as session:
    response = session.get(url, headers=headers)
    content = response.content

soup = BeautifulSoup(content, "html.parser")
titles = soup.find_all("h3")
for h3 in titles:
    print(h3.text)

2020年6月15日,明天您将采取什么行动

他们正在为星期一做准备。

本周即将到来的冷战

只是打个电话而已

这感觉不乐观

我来自过去。这就是将会发生的事情。

公牛轻踩我们参加锣舞表演

我有个好主意来修复我的投资组合(声音打开)

我只建议您更改User-Agent,因为reddit会阻止多次发送请求的User-Agent

相关问题