我将从以下 URL 中提取热门帖子“https://healthunlocked.com/positivewellbeing/posts。有一个为流行帖子定义的按钮...为了提取此按钮下的帖子,我按照以下步骤操作命令,但我看不到任何返回。
def parse(self, response):
popular_posts = response.css ('button.postFilterInline__RoundedButton-cftabs-1 eYSfqD')
listtitles=[]
#listpost=[]
#listreplies=[]
#listpost_link=[]
#listauthor=[]
for populars in popular_posts:
all_div_posts = populars.css('.results-post')
for posts in all_div_posts:
for title in posts.css('.results-post .results-post__title::text').extract():
listtitles.append(title)
yield {"title" : listtitles}
答案 0 :(得分:0)
那个网站是用 Js 构建的,所以浏览 html 选择器不是一个好主意。如果您在浏览器中打开网络并单击“流行”btn,您将发送一个 http 请求并响应一个 json。
尝试直接抓取 https://solaris.healthunlocked.com/posts/positivewellbeing/popular 并处理 json :)
答案 1 :(得分:0)
正如 Maximo 提到的,您可以获取您的信息 JSON string
,您可以使用 json.loads()
方法对其进行解析。
示例
import requests, json
headers = {"user-agent": "Mozilla/5.0"}
url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular'
r = requests.get(url,headers=headers)
posts = json.loads(r.text)
for post in posts:
print(post['title'])
输出
Ok my Pilker baiters name these four younger photos of celebrities ?
Ok, I'm reminiscing for our UK members
Sonya refused to get out of bed today ????
Walk early this morning LOW SELF ESTEEM CHAT
Saturday Night “Plenty of fish” ????
要获得更多帖子,您必须使用参数 pageNumber
调用 url 并循环。示例使用 2 页的深度。
还有第二个循环,它将转到帖子网址并获取帖子的文本。请注意,如上所述,您将无法获得整个帖子,因为您必须登录。
请对服务器温柔一点,使用睡眠来延迟您的请求。
示例
import requests, json
from bs4 import BeautifulSoup
from time import sleep
headers = {"user-agent": "Mozilla/5.0"}
pages = 2
data = []
for page in range(1,pages):
url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber={0}'.format(page)
r = requests.get(url,headers=headers)
posts = json.loads(r.text)
for post in posts:
sleep(3.5)
url = 'https://healthunlocked.com/positivewellbeing/posts/{0}'.format(post['postId'])
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
post['postText'] = soup.select_one('div.post-body').get_text('|', strip=True)
data.append(post)
sleep(0.8)
posts
答案 2 :(得分:-1)
我正在使用硒
安装硒:
pip install selenium webdriver_manager
代码:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://healthunlocked.com/positivewellbeing/posts")
time.sleep(2)
driver.find_element_by_id("ccc-notify-accept").click() # click "I'm ok with that" on coockies prompt
for button in driver.find_elements_by_tag_name("button"):
if button.text == "Popular":
button.click()
time.sleep(0.5)
break
for post in driver.find_element_by_class_name("results-posts").find_elements_by_class_name("results-post"):
print(post.text)