但是,使用请求抓取失败。
抓取结果为空<div id="social-comment">
。
我想抓取网页HTML(所有元素)。
在python中尝试过源代码。
#-*- coding: utf-8 -*-
import requests
url = 'http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117'
response = requests.get(url)
print(response.content)
html失败:
...
<div id="social-comment">
<ul class="cmt_lst">
<li class="ld"><span>로딩중입니다.</span></li>
</ul>
</div>
...
预期的html:
...
<div id="social-comment">
<div class="sc_cmt_wrp" queryid="C1431107741291317890" style="display: block;">
<div id="tabArea">
<ul class="cmt_tab">
<li style="width:50%" class="on">
<a href="#" class="_tab(commentListPage) _nclicks(rpt.list)">댓글 <span class="_count">22</span></a>
</li>
<li><a href="#" class="_tab(commentWritingPage) _nclicks(rpt.write)">댓글쓰기</a></li>
</ul> </div>
<div id="sortOptionArea" style="display: block;">
<div class="cmt_choice">
<input type="radio" name="scmt-sort" id="newest" class="_scmt_sort(newest) _nclicks(rpt.rct)">
<label for="newest" title="최신순" class="_scmt_sort(newest) _nclicks(rpt.rct) on" onclick="javascript:;">최신순</label>
<input type="radio" name="scmt-sort" id="oldest" class="_scmt_sort(oldest) _nclicks(rpt.old)">
<label for="oldest" title="과거순" class="_scmt_sort(oldest) _nclicks(rpt.old)" onclick="javascript:;">과거순</label>
<input type="radio" name="scmt-sort" id="likability" class="_scmt_sort(likability) _nclicks(rpt.rcm)">
<label for="likability" title="호감순" class="_scmt_sort(likability) _nclicks(rpt.rcm)" onclick="javascript:;">호감순</label>
<input type="radio" name="scmt-sort" id="replycount" class="_scmt_sort(replycount) _nclicks(rpt.rpl)">
<label for="replycount" title="답글많은순" class="_scmt_sort(replycount) _nclicks(rpt.rpl)" onclick="javascript:;">답글많은순</label>
</div> </div>
<div id="noticeArea"></div>
<div id="commentListPaginationArea"> <div class="_noComments _refreshable"></div> <div class="_commentClosed _refreshable"></div> <div id="commentItemArea" class="_refreshable"><ul class="cmt_lst"> <li id="scmt-item-1149361" class=""> <a href="#" class="name _scmt_item(link, userCommentListPage, gno, news117,0002600716, commentNo,1149361) _nclicks(rpt.prf)">ldo1****</a> <p> 서인영 아직도 존예...하....자야는데..</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | <a href="#" class="action _scmt_item(report,commentNo,1149361) _nclicks(rpt.report)">신고</a> </div> <div class="btn_area2"> <div> <a href="#" class="sc_btn _scmt_item(reply,parentCommentNo,1149361) _nclicks(rpt.reply)">답글 <strong>0</strong></a> </div> <div> <a href="#" id="scmt-good-comment-1149361" class="sc_btn recomm _scmt_item(good,commentNo,1149361) _nclicks(rpt.sym)">10</a> <a href="#" id="scmt-bad-comment-1149361" class="sc_btn recomm2 _scmt_item(bad,commentNo,1149361) _nclicks(rpt.opp)">7</a> </div> </div> </li> <li id="scmt-item-1149360" class=""> <a href="#" class="name _scmt_item(link, userCommentListPage, gno, news117,0002600716, commentNo,1149360) _nclicks(rpt.prf)">dbgh****</a> <p> 1빠 댓글수채우기.</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | <a href="#" class="action _scmt_item(report,commentNo,1149360) _nclicks(rpt.report)">신고</a> </div> <div class="btn_area2"> <div> <a href="#" class="sc_btn _scmt_item(reply,parentCommentNo,1149360) _nclicks(rpt.reply)">답글 <strong>0</strong></a> </div> <div> <a href="#" id="scmt-good-comment-1149360" class="sc_btn recomm _scmt_item(good,commentNo,1149360) _nclicks(rpt.sym)">1</a> <a href="#" id="scmt-bad-comment-1149360" class="sc_btn recomm2 _scmt_item(bad,commentNo,1149360) _nclicks(rpt.opp)">5</a> </div> </div> </li> </ul></div> <div id="paginationArea" style="display: block;"> <div class="cmt_pg"> <a href="#" class="cmt_pg_btn scmt-page-prev _nclicks(rpt.prev)" style="display: inline-block;"><span class="cmt_pg_prev">이전</span></a> <span class="cmt_pg_btn uc_vh scmt-page-prev-off" style="display: none;"><span class="cmt_pg_prev">이전</span></span> <em class="cmt_pg_pg _pageInfo">21 - 22 <span class="u_vc">페이지 </span><span class="cmt_pg_total">/ <span class="u_vc">총 </span>22<span class="u_vc"> 페이지</span></span></em> <span class="cmt_pg_btn uc_vh scmt-page-next-off" style="display: inline-block;"><span class="cmt_pg_next">다음</span></span> <a href="#" class="cmt_pg_btn scmt-page-next _nclicks(rpt.next)" style="display: none;"><span class="cmt_pg_next">다음</span></a> </div></div> </div></div></div>
...
答案 0 :(得分:0)
您尝试阅读的页面会在页面加载后在XHR request中加载评论。因此,除非您使用的是模拟完整浏览器的某些工具(执行javascript并加载外部资源),否则您不会加载注释。
评论会在发送到http://m.entertain.naver.com/api/comment/list.json
的POST请求中加载返回一个JSON对象,包含您要查找的所有数据。
因为它是一个POST请求,它正在寻找您可以发送的数据。在我的测试中,您需要提供的最少信息似乎是:
编码为URL字符串(这是数据在发送urlencoded时实际作为POST请求发送的方式),这变为gno = news117%2C0002600716&amp; page = 2&amp; sort = newest&amp; pageSize = 20&amp; serviceId =消息
关于如何在Python中发布POST请求,请参阅here和here。
关于如何解析返回的JSON,请参阅here。