我想在Python中抓取网页html(所有元素)

时间:2015-05-08 18:02:17

标签: python html curl web-crawler wget

我想抓取新闻评论(http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117)。

但是,使用请求抓取失败。

抓取结果为空<div id="social-comment">

我想抓取网页HTML(所有元素)。

在python中尝试过源代码。

#-*- coding: utf-8 -*-


import requests

url = 'http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117'

response = requests.get(url)

print(response.content)

html失败:

...
<div id="social-comment">    
<ul class="cmt_lst">
   <li class="ld"><span>로딩중입니다.</span></li>
</ul>    
</div>
...

预期的html:

...
<div id="social-comment">
<div class="sc_cmt_wrp" queryid="C1431107741291317890" style="display: block;">    
<div id="tabArea">  
<ul class="cmt_tab">  
<li style="width:50%" class="on">
<a href="#" class="_tab(commentListPage) _nclicks(rpt.list)">댓글 <span class="_count">22</span></a>
</li>  
<li><a href="#" class="_tab(commentWritingPage) _nclicks(rpt.write)">댓글쓰기</a></li>  
</ul> </div> 
<div id="sortOptionArea" style="display: block;">  
<div class="cmt_choice">   
<input type="radio" name="scmt-sort" id="newest" class="_scmt_sort(newest) _nclicks(rpt.rct)">
<label for="newest" title="최신순" class="_scmt_sort(newest) _nclicks(rpt.rct) on" onclick="javascript:;">최신순</label>   
<input type="radio" name="scmt-sort" id="oldest" class="_scmt_sort(oldest) _nclicks(rpt.old)">
<label for="oldest" title="과거순" class="_scmt_sort(oldest) _nclicks(rpt.old)" onclick="javascript:;">과거순</label>    
<input type="radio" name="scmt-sort" id="likability" class="_scmt_sort(likability) _nclicks(rpt.rcm)">
<label for="likability" title="호감순" class="_scmt_sort(likability) _nclicks(rpt.rcm)" onclick="javascript:;">호감순</label>   
<input type="radio" name="scmt-sort" id="replycount" class="_scmt_sort(replycount) _nclicks(rpt.rpl)">
<label for="replycount" title="답글많은순" class="_scmt_sort(replycount) _nclicks(rpt.rpl)" onclick="javascript:;">답글많은순</label>  
</div> </div>
<div id="noticeArea"></div>
<div id="commentListPaginationArea">  <div class="_noComments _refreshable"></div>  <div class="_commentClosed _refreshable"></div>  <div id="commentItemArea" class="_refreshable"><ul class="cmt_lst">                 <li id="scmt-item-1149361" class="">         <a href="#" class="name _scmt_item(link, userCommentListPage, gno, news117,0002600716, commentNo,1149361) _nclicks(rpt.prf)">ldo1****</a>      <p>   서인영 아직도 존예...하....자야는데..</p>   <div class="func">    <span class="time">2015.04.28 오후 11:49</span>         <span class="mobile">모바일에서 작성</span>             | <a href="#" class="action _scmt_item(report,commentNo,1149361) _nclicks(rpt.report)">신고</a>            </div>    <div class="btn_area2">    <div>     <a href="#" class="sc_btn _scmt_item(reply,parentCommentNo,1149361) _nclicks(rpt.reply)">답글 <strong>0</strong></a>         </div>    <div>     <a href="#" id="scmt-good-comment-1149361" class="sc_btn recomm _scmt_item(good,commentNo,1149361) _nclicks(rpt.sym)">10</a>     <a href="#" id="scmt-bad-comment-1149361" class="sc_btn recomm2 _scmt_item(bad,commentNo,1149361) _nclicks(rpt.opp)">7</a>    </div>   </div>    </li>                 <li id="scmt-item-1149360" class="">         <a href="#" class="name _scmt_item(link, userCommentListPage, gno, news117,0002600716, commentNo,1149360) _nclicks(rpt.prf)">dbgh****</a>      <p>   1빠 댓글수채우기.</p>   <div class="func">    <span class="time">2015.04.28 오후 11:49</span>         <span class="mobile">모바일에서 작성</span>             | <a href="#" class="action _scmt_item(report,commentNo,1149360) _nclicks(rpt.report)">신고</a>            </div>    <div class="btn_area2">    <div>     <a href="#" class="sc_btn _scmt_item(reply,parentCommentNo,1149360) _nclicks(rpt.reply)">답글 <strong>0</strong></a>         </div>    <div>     <a href="#" id="scmt-good-comment-1149360" class="sc_btn recomm _scmt_item(good,commentNo,1149360) _nclicks(rpt.sym)">1</a>     <a href="#" id="scmt-bad-comment-1149360" class="sc_btn recomm2 _scmt_item(bad,commentNo,1149360) _nclicks(rpt.opp)">5</a>    </div>   </div>    </li>  </ul></div>  <div id="paginationArea" style="display: block;">             <div class="cmt_pg">  <a href="#" class="cmt_pg_btn scmt-page-prev _nclicks(rpt.prev)" style="display: inline-block;"><span class="cmt_pg_prev">이전</span></a>  <span class="cmt_pg_btn uc_vh scmt-page-prev-off" style="display: none;"><span class="cmt_pg_prev">이전</span></span>  <em class="cmt_pg_pg _pageInfo">21 - 22 <span class="u_vc">페이지 </span><span class="cmt_pg_total">/ <span class="u_vc">총 </span>22<span class="u_vc"> 페이지</span></span></em>  <span class="cmt_pg_btn uc_vh scmt-page-next-off" style="display: inline-block;"><span class="cmt_pg_next">다음</span></span>  <a href="#" class="cmt_pg_btn scmt-page-next _nclicks(rpt.next)" style="display: none;"><span class="cmt_pg_next">다음</span></a> </div></div> </div></div></div>
...

1 个答案:

答案 0 :(得分:0)

您尝试阅读的页面会在页面加载后在XHR request中加载评论。因此,除非您使用的是模拟完整浏览器的某些工具(执行javascript并加载外部资源),否则您不会加载注释。

评论会在发送到http://m.entertain.naver.com/api/comment/list.json

的POST请求中加载

返回一个JSON对象,包含您要查找的所有数据。

因为它是一个POST请求,它正在寻找您可以发送的数据。在我的测试中,您需要提供的最少信息似乎是:

  • gno:news117,0002600716
  • pageSize:2000(20是默认值。在大多数情况下,2,000可能会给你所有评论,但你应该根据需要进行调整。)
  • 排序:最新

编码为URL字符串(这是数据在发送urlencoded时实际作为POST请求发送的方式),这变为gno = news117%2C0002600716&amp; page = 2&amp; sort = newest&amp; pageSize = 20&amp; serviceId =消息

关于如何在Python中发布POST请求,请参阅herehere

关于如何解析返回的JSON,请参阅here