如何抓取热门帖子

时间:2021-01-21 10:30:47

标签: python web-scraping web-crawler

我将从以下 URL 中提取热门帖子“https://healthunlocked.com/positivewellbeing/posts。有一个为流行帖子定义的按钮...为了提取此按钮下的帖子,我按照以下步骤操作命令,但我看不到任何返回。

 def parse(self, response):
           
    popular_posts = response.css ('button.postFilterInline__RoundedButton-cftabs-1 eYSfqD')
    
    listtitles=[]
    #listpost=[]
    #listreplies=[]
    #listpost_link=[]
    #listauthor=[]
          
    for populars in popular_posts:
        
        all_div_posts = populars.css('.results-post')

        for posts in all_div_posts:
        
           for title in posts.css('.results-post .results-post__title::text').extract():
        
            listtitles.append(title)
           
        yield {"title" : listtitles} 

3 个答案:

答案 0 :(得分:0)

那个网站是用 Js 构建的,所以浏览 html 选择器不是一个好主意。如果您在浏览器中打开网络并单击“流行”btn,您将发送一个 http 请求并响应一个 json。

尝试直接抓取 https://solaris.healthunlocked.com/posts/positivewellbeing/popular 并处理 json :)

答案 1 :(得分:0)

以防万一

正如 Maximo 提到的,您可以获取您的信息 JSON string,您可以使用 json.loads() 方法对其进行解析。

示例

import requests, json
headers = {"user-agent": "Mozilla/5.0"}
url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular'
r = requests.get(url,headers=headers)

posts = json.loads(r.text)

for post in posts:
    print(post['title'])

输出

Ok my Pilker baiters name  these four younger photos  of celebrities ? 
Ok, I'm  reminiscing for our UK  members 
Sonya refused to get out of bed today ???? 
Walk early this morning LOW SELF ESTEEM CHAT  
Saturday Night “Plenty of fish” ????

编辑

要获得更多帖子,您必须使用参数 pageNumber 调用 url 并循环。示例使用 2 页的深度。

还有第二个循环,它将转到帖子网址并获取帖子的文本。请注意,如上所述,您将无法获得整个帖子,因为您必须登录。

请对服务器温柔一点,使用睡眠来延迟您的请求。

示例

import requests, json
from bs4 import BeautifulSoup
from time import sleep
headers = {"user-agent": "Mozilla/5.0"}
pages = 2
data = []

for page in range(1,pages):
    
    url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber={0}'.format(page)
    r = requests.get(url,headers=headers)

    posts = json.loads(r.text)

    for post in posts:        
        sleep(3.5)
        url = 'https://healthunlocked.com/positivewellbeing/posts/{0}'.format(post['postId'])
        r = requests.get(url,headers=headers)
        soup = BeautifulSoup(r.text, 'lxml')
        post['postText'] = soup.select_one('div.post-body').get_text('|', strip=True)
    data.append(post)
    
    sleep(0.8)
posts

答案 2 :(得分:-1)

我正在使用硒

安装硒:

pip install selenium webdriver_manager

代码:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time


driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://healthunlocked.com/positivewellbeing/posts")
time.sleep(2)
driver.find_element_by_id("ccc-notify-accept").click() # click "I'm ok with that" on coockies prompt

for button in driver.find_elements_by_tag_name("button"):
    if button.text == "Popular":
        button.click()
        time.sleep(0.5)
        break
for post in driver.find_element_by_class_name("results-posts").find_elements_by_class_name("results-post"):
    print(post.text)
相关问题