网站抓取-数据提取-网站抓取工具Google Chrome扩展程序

时间:2019-11-28 17:51:17

标签: web-scraping google-chrome-extension screen-scraping data-extraction

下午好,

我正在尝试从杂货店提取所有产品(名称,价格,图片)。

我正在使用网络抓取工具(Google Chrome扩展程序)。 当我开始抓取时,我可以看到它正在运行,但是它不返回任何数据。 当我单击数据预览时,我可以看到数据。但是,我一直收到消息,没有刮擦数据。

这是我创建的站点地图: {“ _id”:“ collectandgo”,“ startUrl”:[“ {https://colruyt.collectandgo.be/cogo/nl/home”],“选择器”:[{“ id”:“类别”,“类型”:“ SelectorLink”,“ parentSelectors”:[ “ _root”],“ selector”:“ div#arbo.nav__branch.branch”,“ multiple”:true,“ delay”:0},{“ id”:“ items”,“ type”:“ SelectorElement”,“ parentSelectors“:[”“类别”],“ selector”:“ div.product__inner”,“ multiple”:true,“ delay”:0},{“ id”:“ productbody”,“ type”:“ SelectorElement”,“ parentSelectors“:[” items“],” selector“:” div.product__body“,” multiple“:true,” delay“:0},{” id“:” image“,” type“:” SelectorImage“,” parentSelectors“:[” productbody“],” selector“:” a.product__image“,” multiple“:false,” delay“:0},{” id“:” productname“,” type“:” SelectorText“,” parentSelectors“:[” productbody“],” selector“:” div.product__name“,” multiple“:false,” regex“:”“,” delay“:0},{” id“:” productdescription“,” type “:” SelectorText“,” parentSelectors“:[” productbody“],” selector“:” div.product__description“,” multiple“:false,” regex“:”“,” delay“:0},{” id“ :“ productweight”,“ type”:“ SelectorText”,“ parentSelectors”:[“ productbody”],“ selector”:“ div.product__weight” ,“ multiple”:false,“ regex”:“”,“ delay”:0},{“ id”:“ prijs”,“ type”:“ SelectorText”,“ parentSelectors”:[“ productbody”],“选择器“:” div.product__price-piece“,” multiple“:false,” regex“:”“,” delay“:0},{” id“:” eenheidsprijs“,” type“:” SelectorText“,” parentSelectors“ :[“ productbody”],“选择器”:“ div.product__price-unit”,“ multiple”:false,“ regex”:“”,“ delay”:0},{“ id”:“ korting-aankoop-hoeveelheid “,” type“:” SelectorText“,” parentSelectors“:[” productbody“],” selector“:” a.promotion__min-amount“,” multiple“:false,” regex“:”“,” delay“:0 }]}

1 个答案:

答案 0 :(得分:0)

我复制了您的JSONvalidated,然后将其复制到文件stack.json,然后在设置了解析器之后将其加载到BaseX数据库foo中转到JSON,如下所示:

thufir@dur:~/json$ 
thufir@dur:~/json$ basex
BaseX 9.0.1 [Standalone]
Try 'help' to get more information.
> 
> list
Name                 Resources  Size    Input Path                               
-------------------------------------------------------------------------------
com.w3schools.books  1          6290    https://www.w3schools.com/xml/books.xml  
twitter              75         457900                                           
w3school_data        1          5209    https://www.w3schools.com/xml/note.xml   

3 database(s).
> 
> create database foo
Database 'foo' created in 138.51 ms.
> 
> set parser json
PARSER: json
> 
> add stack.json
Resource(s) added in 74.72 ms.
> 
> list
Name                 Resources  Size    Input Path                               
-------------------------------------------------------------------------------
com.w3schools.books  1          6290    https://www.w3schools.com/xml/books.xml  
foo                  1          5600                                             
twitter              75         457900                                           
w3school_data        1          5209    https://www.w3schools.com/xml/note.xml   

4 database(s).
> 
> open foo
Database 'foo' was opened in 0.04 ms.
> 
> xquery /
<json type="object">
  <__id>collectandgo</__id>
  <startUrl type="array">
    <_>https://colruyt.collectandgo.be/cogo/nl/home</_>
  </startUrl>
  <selectors type="array">
    <_ type="object">
      <id>categories</id>
      <type>SelectorLink</type>
      <parentSelectors type="array">
        <_>_root</_>
      </parentSelectors>
      <selector>div#arbo.nav__branch.branch</selector>
      <multiple type="boolean">true</multiple>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>items</id>
      <type>SelectorElement</type>
      <parentSelectors type="array">
        <_>categories</_>
      </parentSelectors>
      <selector>div.product__inner</selector>
      <multiple type="boolean">true</multiple>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>productbody</id>
      <type>SelectorElement</type>
      <parentSelectors type="array">
        <_>items</_>
      </parentSelectors>
      <selector>div.product__body</selector>
      <multiple type="boolean">true</multiple>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>image</id>
      <type>SelectorImage</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>a.product__image</selector>
      <multiple type="boolean">false</multiple>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>productname</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__name</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>productdescription</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__description</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>productweight</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__weight</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>prijs</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__price-piece</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>eenheidsprijs</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__price-unit</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>korting-aankoop-hoeveelheid</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>a.promotion__min-amount</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
  </selectors>
</json>
Query executed in 270.99 ms.
> 

您要对数据运行什么查询?

您可能想研究Selenium或其他用于抓取数据的工具。 SeleniumBaseX都使用Xquery并提供Java API。