Question

我是Nutch的新手。我已经在我的Windows 10 64位机器上安装了Nutch 1.12一周了。我想从http://www.myntra.com/men-tshirts中提取与XPath //a[@class="product-link"]/img/@src匹配的图片网址。我在seed.txt文件中提供了种子网址，并按如下方式编辑了我的regex-urlfilter.txt：

-\.(gif|GIF|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://([a-z0-9]*\.)*www.myntra.com/men-tshirts
+^http://([a-z0-9]*\.)*assets.myntassets.com/

我已移除.jpg扩展名，以免忽略看起来像“http://assets.myntassets.com/h_240,q_95,w_180/v1/image/style/properties/560433/ELABORADO-Men-Black-T-shirt_1_2010376f80fc4a65f4baac2f8c39082a_mini.jpg”

的图片网址

我还修改了suffix-urlfilter.txt个文件，并从禁止的扩展名列表中删除了.jpg个扩展名。

但我最终无法提取任何图片网址。

以下是我遵循的步骤：

1：bin / nutch注入crawl / crawldb网址

2：bin / nutch生成crawl / crawldb crawl / segments

3：s1 = ls -d crawl/segments/2* | tail -1

4：bin / nutch fetch $ s1

5：bin / nutch解析$ s1

6：bin / nutch updatedb crawl / crawldb $ s1

当我尝试到这里的步骤并将数据索引到Solr时我看到只有一个文档被索引，并且没有看到任何与图像URL相关的内容。然后我尝试了下一步的下一步

7：bin / nutch生成crawl / crawldb crawl / segments -topN 1000

8：s2 = ls -d crawl/segments/2* | tail -1

9：bin / nutch fetch $ s2

在步骤9之后，在控制台中，我看不到任何URL被选中。

有人可以帮助我理解这个问题，并指导我如何从给定的种子URL中提取图像URL ???

Nutch：无法使用Nutch 1.12提取图片网址

0 个答案: