我正在使用 bs4 来抓取 Product Hunt。
以this post为例,当我使用下面的代码抓取它时,“讨论”部分完全没有。
res = requests.get('https://producthunt.com/posts/weights-biases')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
pprint.pprint(soup.prettify())
我怀疑这与延迟加载有关(当您打开页面时,“讨论”部分需要额外一两秒才会出现)。
如何抓取延迟加载的组件?或者这完全是另一回事?
答案 0 :(得分:1)
页面的某些元素似乎是通过 Javascript 查询动态加载的。
requests
库允许您手动发送查询,然后使用 bs4 解析更新页面的内容。
但是,根据我对动态网页的经验,如果您要发送大量查询,这种方法可能会很烦人。
通常在这些情况下,最好使用集成了实时浏览器模拟的库。这样,模拟器本身将处理客户端-服务器通信并更新页面;您只需等待元素加载完毕,然后安全地分析它们。
所以我建议你看看 selenium
甚至 selenium-requests
如果你更喜欢保持 requests
'哲学'.
答案 1 :(得分:0)
这是获取讨论中评论的方式。您可以随时修改脚本以获得每个线程得到的相关回复。
import json
import requests
from pprint import pprint
url = 'https://www.producthunt.com/frontend/graphql'
payload = {"operationName":"PostPageCommentsSection","variables":{"commentsListSubjectThreadsCursor":"","commentsThreadRepliesCursor":"","slug":"weights-biases","includeThreadForCommentId":None,"commentsListSubjectThreadsLimit":10},"query":"query PostPageCommentsSection($slug:String$commentsListSubjectThreadsCursor:String=\"\"$commentsListSubjectThreadsLimit:Int!$commentsThreadRepliesCursor:String=\"\"$commentsListSubjectFilter:ThreadFilter$includeThreadForCommentId:ID$excludeThreadForCommentId:ID){post(slug:$slug){id canManage ...PostPageComments __typename}}fragment PostPageComments on Post{_id id slug name ...on Commentable{_id id canComment __typename}...CommentsSubject ...PostReviewable ...UserSubscribed meta{canonicalUrl __typename}__typename}fragment PostReviewable on Post{id slug name canManage featuredAt createdAt disabledWhenScheduled ...on Reviewable{_id id reviewsCount reviewsRating isHunter isMaker viewerReview{_id id sentiment comment{id body __typename}__typename}...on Commentable{canComment commentsCount __typename}__typename}meta{canonicalUrl __typename}__typename}fragment CommentsSubject on Commentable{_id id ...CommentsListSubject __typename}fragment CommentsListSubject on Commentable{_id id threads(first:$commentsListSubjectThreadsLimit after:$commentsListSubjectThreadsCursor filter:$commentsListSubjectFilter include_comment_id:$includeThreadForCommentId exclude_comment_id:$excludeThreadForCommentId){edges{node{_id id ...CommentThread __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}__typename}fragment CommentThread on Comment{_id id isSticky replies(first:5 after:$commentsThreadRepliesCursor allForCommentId:$includeThreadForCommentId){edges{node{_id id ...Comment __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}...Comment __typename}fragment Comment on Comment{_id id badges body bodyHtml canEdit canReply canDestroy createdAt isHidden path repliesCount subject{_id id ...on Commentable{_id id __typename}__typename}user{_id id headline name firstName username headline ...UserSpotlight __typename}poll{...PollFragment __typename}review{id sentiment __typename}...CommentVote ...FacebookShareButtonFragment __typename}fragment CommentVote on Comment{_id id ...on Votable{_id id hasVoted votesCount __typename}__typename}fragment FacebookShareButtonFragment on Shareable{id url __typename}fragment UserSpotlight on User{_id id headline name username ...UserImage __typename}fragment UserImage on User{_id id name username avatar headline isViewer ...KarmaBadge __typename}fragment KarmaBadge on User{karmaBadge{kind score __typename}__typename}fragment PollFragment on Poll{id answersCount hasAnswered options{id text imageUuid answersCount answersPercent hasAnswered __typename}__typename}fragment UserSubscribed on Subscribable{_id id isSubscribed __typename}"}
r = requests.post(url,json=payload)
for item in r.json()['data']['post']['threads']['edges']:
pprint(item['node']['body'])
此时输出:
('Looks like such a powerful tool for extracting performance insights! '
'Absolutely love the documentation feature, awesome work!')
('This is awesome and so Any discounts or special pricing for '
'researchers/students/non-professionals?')
'Amazing. I think this is very helpful tools for us. Keep it up & go ahead.'
('<p>This simple system of record automatically saves logs from every '
'experiment, making it easy to look over the history of your progress and '
'compare new models with existing baselines.</p>\n'
'Pros: <p>Easy, fast, and lightweight experiment tracking</p>\n'
'Cons: <p>Only available for Python projects</p>')
('Very cool! I hacked together something similar but much more basic for '
"personal use and always wondered why TensorBoard didn't solve this problem. "
'I just wish this was open source! :) P.S. awesome use of the parallel '
'co-ordinates d3.js chart - great idea to apply it to experiment '
'configurations!')