下午好,
我正在尝试从杂货店提取所有产品(名称,价格,图片)。
我正在使用网络抓取工具(Google Chrome扩展程序)。 当我开始抓取时,我可以看到它正在运行,但是它不返回任何数据。 当我单击数据预览时,我可以看到数据。但是,我一直收到消息,没有刮擦数据。
这是我创建的站点地图: {“ _id”:“ collectandgo”,“ startUrl”:[“ {https://colruyt.collectandgo.be/cogo/nl/home”],“选择器”:[{“ id”:“类别”,“类型”:“ SelectorLink”,“ parentSelectors”:[ “ _root”],“ selector”:“ div#arbo.nav__branch.branch”,“ multiple”:true,“ delay”:0},{“ id”:“ items”,“ type”:“ SelectorElement”,“ parentSelectors“:[”“类别”],“ selector”:“ div.product__inner”,“ multiple”:true,“ delay”:0},{“ id”:“ productbody”,“ type”:“ SelectorElement”,“ parentSelectors“:[” items“],” selector“:” div.product__body“,” multiple“:true,” delay“:0},{” id“:” image“,” type“:” SelectorImage“,” parentSelectors“:[” productbody“],” selector“:” a.product__image“,” multiple“:false,” delay“:0},{” id“:” productname“,” type“:” SelectorText“,” parentSelectors“:[” productbody“],” selector“:” div.product__name“,” multiple“:false,” regex“:”“,” delay“:0},{” id“:” productdescription“,” type “:” SelectorText“,” parentSelectors“:[” productbody“],” selector“:” div.product__description“,” multiple“:false,” regex“:”“,” delay“:0},{” id“ :“ productweight”,“ type”:“ SelectorText”,“ parentSelectors”:[“ productbody”],“ selector”:“ div.product__weight” ,“ multiple”:false,“ regex”:“”,“ delay”:0},{“ id”:“ prijs”,“ type”:“ SelectorText”,“ parentSelectors”:[“ productbody”],“选择器“:” div.product__price-piece“,” multiple“:false,” regex“:”“,” delay“:0},{” id“:” eenheidsprijs“,” type“:” SelectorText“,” parentSelectors“ :[“ productbody”],“选择器”:“ div.product__price-unit”,“ multiple”:false,“ regex”:“”,“ delay”:0},{“ id”:“ korting-aankoop-hoeveelheid “,” type“:” SelectorText“,” parentSelectors“:[” productbody“],” selector“:” a.promotion__min-amount“,” multiple“:false,” regex“:”“,” delay“:0 }]}
答案 0 :(得分:0)
我复制了您的JSON
和validated,然后将其复制到文件stack.json
,然后在设置了解析器之后将其加载到BaseX
数据库foo
中转到JSON
,如下所示:
thufir@dur:~/json$
thufir@dur:~/json$ basex
BaseX 9.0.1 [Standalone]
Try 'help' to get more information.
>
> list
Name Resources Size Input Path
-------------------------------------------------------------------------------
com.w3schools.books 1 6290 https://www.w3schools.com/xml/books.xml
twitter 75 457900
w3school_data 1 5209 https://www.w3schools.com/xml/note.xml
3 database(s).
>
> create database foo
Database 'foo' created in 138.51 ms.
>
> set parser json
PARSER: json
>
> add stack.json
Resource(s) added in 74.72 ms.
>
> list
Name Resources Size Input Path
-------------------------------------------------------------------------------
com.w3schools.books 1 6290 https://www.w3schools.com/xml/books.xml
foo 1 5600
twitter 75 457900
w3school_data 1 5209 https://www.w3schools.com/xml/note.xml
4 database(s).
>
> open foo
Database 'foo' was opened in 0.04 ms.
>
> xquery /
<json type="object">
<__id>collectandgo</__id>
<startUrl type="array">
<_>https://colruyt.collectandgo.be/cogo/nl/home</_>
</startUrl>
<selectors type="array">
<_ type="object">
<id>categories</id>
<type>SelectorLink</type>
<parentSelectors type="array">
<_>_root</_>
</parentSelectors>
<selector>div#arbo.nav__branch.branch</selector>
<multiple type="boolean">true</multiple>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>items</id>
<type>SelectorElement</type>
<parentSelectors type="array">
<_>categories</_>
</parentSelectors>
<selector>div.product__inner</selector>
<multiple type="boolean">true</multiple>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>productbody</id>
<type>SelectorElement</type>
<parentSelectors type="array">
<_>items</_>
</parentSelectors>
<selector>div.product__body</selector>
<multiple type="boolean">true</multiple>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>image</id>
<type>SelectorImage</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>a.product__image</selector>
<multiple type="boolean">false</multiple>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>productname</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__name</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>productdescription</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__description</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>productweight</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__weight</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>prijs</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__price-piece</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>eenheidsprijs</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__price-unit</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>korting-aankoop-hoeveelheid</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>a.promotion__min-amount</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
</selectors>
</json>
Query executed in 270.99 ms.
>
您要对数据运行什么查询?
您可能想研究Selenium
或其他用于抓取数据的工具。 Selenium
和BaseX
都使用Xquery
并提供Java API。