从谷歌图像中搜索R网页

时间:2016-12-21 07:51:49

标签: html r image xpath web-scraping

为了不同的目的,我使用了“rvest”软件包进行网页搜索。现在我需要使用它从谷歌图像获取图像对象(png)的来源。我在此链接上尝试了解决方案:Web scraping of image。它完全符合我的要求。所以我想出了下面的代码,但我的html_nodes函数获取了空对象。

library("rvest")
page <- read_html("https://www.google.com.tr/search?q=manitou&espv=2&biw=1366&bih=662&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjCnJ6H2ITRAhWCQBoKHfQ5DUAQ_AUIBigB#tbm=isch&q=apple+logo+png")
node <- html_nodes(page,xpath='//*[@id="rg_s"]/div[1]/a/img')
src <-  html_attr(node,"src")

我还尝试了css选择器和图像的名称,因为它是在我上面给出的链接上完成的。我的节点对象在任何方面都是空的。我还应该指出,我想在链接上抓取第一个图像的来源,该链接具有我在上面写的xpath。提前谢谢。

1 个答案:

答案 0 :(得分:4)

我认为它工作正常,你只是不能很好地理解该文件的构成,即可能没有与你编写的xpath选择器对应的节点。

例如,我选择所有<img>节点并打印出来:

library("rvest")
page <- read_html("https://www.google.com.tr/search?q=manitou&espv=2&biw=1366&bih=662&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjCnJ6H2ITRAhWCQBoKHfQ5DUAQ_AUIBigB#tbm=isch&q=apple+logo+png")
node <- html_nodes(page,xpath = '//img')
node

得到以下特性:

{xml_nodeset (21)}
 [1] <img style="padding-top:2px" src="/textinputassistant/tia.png" onclick="(function(){var text_input_assistant_js='/textinputassistant/11/tr_tia.js';var s = document.createElement('s ...
 [2] <img height="113" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRg92_01ZbpYpV_agaHP4M3GoRoaCsZW5Sym8eqcXG8M1iJ8Nag1SXufq8" width="150" alt="manitou ile ilgili görsel s ...
 [3] <img height="98" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSbJOecoEPbrJjZ-TjJMgMwlulXRMPLBWZX45vwUJNVXZk5MeY1chaZ07Y" width="143" alt="manitou ile ilgili görsel so ...
 [4] <img height="79" src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcStpgymO--9B7R3O3OZJFrDsuOUuP94HwwNw-av9tUyjziG3sCl6M9s7G4" width="141" alt="manitou ile ilgili görsel so ...
 [5] <img height="95" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTkibMqBWEifcyw_d-vrNob6UqYP-hDFPoQG2pkzVsP5bgmbReFWqyHjWA" width="143" alt="manitou ile ilgili görsel so ...
 [6] <img height="91" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRhqrV1f--7QrQwovNBUHIpDFHe8Zwwad3UIvnwppv74GRIrsI1XYNPkFOg" width="150" alt="manitou ile ilgili görsel s ...
 [7] <img height="112" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS1gpUEBucliP4WK2_22K4wElI2lIrDs2PZT7sRCLXK1Yxjg7DoQ2BtyLat" width="142" alt="manitou ile ilgili görsel  ...
 [8] <img height="69" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSssiUhuZe_1YmQ9dwmYHdKoFXyQBj9IQPGX_LU8msjekOvRRHDG9FmoaD_" width="140" alt="manitou ile ilgili görsel s ...
 [9] <img height="113" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTCM9Mu6K63QpzNk20HFrHkybi--dw3JPu5JDd4LSEqz3UT5TBU5I0owLU" width="150" alt="manitou ile ilgili görsel s ...
[10] <img height="95" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcS8sI3fBSJmjftqC9Rx2bhXh_xgP3-nS2WuD2as9U_87SLxggQvmo2awDk" width="143" alt="manitou ile ilgili görsel so ...
[11] <img height="83" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT-gf45JbC4Q4lD3hioj_CP6imrO5RUWBeW6IuygNaN8LM1qydX56l5gFx4" width="148" alt="manitou ile ilgili görsel s ...
[12] <img height="84" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS6tnxPJYeS48IoNAlN0D52U5TNjmq7Ta-GcPNifM4_k40Y2D8LDj5-e-Wz" width="150" alt="manitou ile ilgili görsel s ...
[13] <img height="140" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTwmI9PxfLBT2dCPnR04I9pXmK8V9whAI2yEv4dX5qQq8G_JxHUAOwQB1mSTg" width="140" alt="manitou ile ilgili görse ...
[14] <img height="71" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQNx2Pe1AZtT-0XQ44HSurWO6O2syXrXG6YPfggtZsTHaf6YXuQlcmMOu0" width="150" alt="manitou ile ilgili görsel so ...
[15] <img height="130" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRGACLfeRm6U0xwSeYncSUDQtcd4noTewVF4aGnQcgz6TWYwwr917mjEtB6" width="113" alt="manitou ile ilgili görsel  ...
[16] <img height="107" src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQ1RwAscQpzVXfquuAoPaLE9hFMuZSOpo6ckOzdpkTmg3KiswOIZIDTqrU" width="143" alt="manitou ile ilgili görsel s ...
[17] <img height="98" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcTE5sLf71TxAYla6nlfLRgXwL1IC-gXzXQRq1ZcnB21c5NXmQklJyNeqEs" width="148" alt="manitou ile ilgili görsel so ...
[18] <img height="91" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRHQjJ-Hc0Muy6Vjw5OlQZocflSCqR3oz0GBRu3Bs7_JCoNyjr5vjNP7KZ4" width="137" alt="manitou ile ilgili görsel s ...
[19] <img height="68" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcR8R_39V3bxWJUDdNhrsAS6YOYEg6U-QpaLEV0MQ5GBnVkeZa9lSB5MaGU" width="149" alt="manitou ile ilgili görsel so ...
[20] <img height="99" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTIrnwcUbo9WYT-gyvrLb5g4JFEc27odkzzU6SwzxrxvrsajRMD1OroUaY" width="116" alt="manitou ile ilgili görsel so ...
...
> 

这是第一个节点:

>node[[1]]
{xml_node} <img style="padding-top:2px"
src="/textinputassistant/tia.png" onclick="(function(){var
   text_input_assistant_js='/textinputassistant/11/tr_tia.js';var s =
  document.createElement('script');s.src =
  text_input_assistant_js;(document.getElementById('xjsc')||
  document.body).appendChild(s);})();" 
   alt="" height="23" width="27">