如何在抓取时在HTML页面中获取评论?

时间:2018-02-18 06:50:51

标签: python html selenium web-scraping scrapy

这是问题所在。我试图抓住这个Facebook关于生日日期的页面,当我在浏览器中看到页面源时,它会在div类名class="hidden_elem"内的html中显示生日日期作为评论。< / p>

这可能就是这个问题,当我在使用(selenium,scrapy,requests)获取请求中看到此页面的源代码时,我只得div class="hidden_elem"并发表评论无处可见,更不用说解析信息了。

那么如何获取此文本,如果可能,请说明如何获取生日日期。

可能有一些javascript的东西在Facebook页面上通过设计巧妙地导致了这一点。怎么绕过这个?

这是我试图获取生日日期的网址。 https://www.facebook.com/profile.php?id=100004456147835&sk=about

从浏览器的源页面看起来像这样: -

<div class="hidden_elem"><code id="u_0_2g"><!-- <ul class="uiList _54nz _4kg _4kt" data-pnref="about"><li><div class="_5aj7"><div class="_4bl9"><div class="_54n- _2pi3"><div id="u_0_2e"></div></div></div><div class="_4bl7"><div class="_4ms4" id="u_0_2a"><div class="clearfix _ikh _5c0g" data-pnref="overview" id="u_0_2f"><div class="_4bl7"><ul class="uiList _1pi3 _4kg _6-h _703 _4ks"><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_344683"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No workplaces to show</span></div></div></div></div></li><li id="u_0_2b"><div class="clearfix _5y02" data-overviewsection="education" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://scontent.fblr6-1.fna.fbcdn.net/v/t1.0-1/c9.0.32.32/p32x32/580846_10149999285985791_1565762244_n.png?oh=d4ccc6a667e53f20db9cf60c0742f989&amp;oe=5B1420C5" alt="" aria-label="Cambridge Institute of technolagy" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Studies at <a class="profileLink" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1">Cambridge Institute of technolagy</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg">Past: <a class="profileLink" href="https://www.facebook.com/deekshaintegrated/" data-hovercard="/ajax/hovercard/page.php?id=176180289071224" data-hovercard-prefer-more-content-show="1">Deeksha Integrated</a> and <a class="profileLink" href="https://www.facebook.com/pages/chethana-vidya-mandiratumkur/378826618888908" data-hovercard="/ajax/hovercard/page.php?id=378826618888908" data-hovercard-prefer-more-content-show="1">chethana vidya mandira,tumkur</a></div></div></div></div></div></div></div></li><li id="u_0_2c"><div class="clearfix _5y02" data-overviewsection="places" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://external.fblr6-1.fna.fbcdn.net/safe_image.php?d=AQCKH3kcP1-A2NPe&amp;w=32&amp;h=32&amp;url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F8%2F80%2FBangaloreMontage.png&amp;cfs=1&amp;fallback=hub_city&amp;f&amp;_nc_hash=AQDbJ1ytdhSz3E8E" alt="" aria-label="Bangalore, India" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Lives in <a class="profileLink" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1">Bangalore, India</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg"><span id="u_0_2d">From <span class="fwb"><a class="profileLink" href="https://www.facebook.com/pages/Tumkur/106525352717093" data-hovercard="/ajax/hovercard/page.php?id=106525352717093" data-hovercard-prefer-more-content-show="1">Tumkur</a></span></span></div></div></div></div></div></div></div></li><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_585866"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No relationship info to show</span></div></div></div></div></li></ul></div><div class="_4bl9 _zu9"><ul class="uiList _5yql _4kg" data-overviewsection="contact_basic" role="button" tabindex="0"><li class="_4tnv _2pif"><div class="clearfix _ikh"><div class="_4bl7"><div class="_pvf _5pmc"><i class="img sp_yw06AF9sktb sx_e0cf75"></i></div></div><div class="_4bl9 _2pis _2dbl"><span class="_c24 _2ieq"><div><span class="accessible_elem">Birthday</span></div><div>April 28, 1998</div></span></div></div></li></ul></div></div></div></div></div></li></ul> --></code></div>

当我从我的脚本中获取页面源时,只有<div class="hidden_elem"> </div>这个来了。

2 个答案:

答案 0 :(得分:2)

使用BeautifulSoup,你可以做到这一点

试试这个: -

from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html, 'lxml')
for comment in soup.findAll(text=lambda text:isinstance(text,Comment)):
    print (comment)

答案 1 :(得分:1)

您需要使用以下命令向下滚动页面

String s = "window.scrollBy(0,document.body.scrollHeight || document.documentElement.scrollHeight)";
            ScriptResult sr = page.executeJavaScript(s);
            LOG.info("Result= " + sr.getJavaScriptResult());

之后,您将能够获取对象的“ hidden_​​elem”列表:

String xpathHiddenElem = "//div[contains(@class, 'hidden_elem')]";
List<Object> responseHiddenElem = page.getByXPath(xpathHiddenElem);
LOG.info("responseHiddenElem: {}", responseHiddenElem);
if (responseHiddenElem != null && responseHiddenElem.size() > 0) {
    for (Object element : responseHiddenElem) {
        HtmlDivision elementCasted = (HtmlDivision) element;
        LOG.info("elementContent: {}", elementCasted.getTextContent());
        LOG.info("elementContent: {}", elementCasted.asText());
        LOG.info("elementContent: {}", elementCasted.getTagName());
        LOG.info("elementContent: {}", elementCasted.getIndex());
    }
}