Question

我想构建一个基于微型图像的搜索引擎，可以向其提供图像文件，它将在solr中搜索相似的图像。我正在对抓取部分使用nutch，并将数据索引到solr中。我已经完成了对conf conf文件的更改，例如-

将image/*添加到了mimetype-filter.txt
从suffix-urlfilter.txt中删除了图像扩展名-不要跳过它们

我还将字段添加到solr schema.xml-

<field name="name" type="string" indexed="true" stored="true" />
<field name="iso" type="string" indexed="true" stored="true" multiValued="true" />
<field name="iso_string" type="string" indexed="true" stored="true" multiValued="true" />
<field name="aperture" type="double" indexed="true" stored="true" />
<field name="exposure" type="string" indexed="true" stored="true" />
<field name="exposure_time" type="double" indexed="true" stored="true" />
<field name="focal" type="string" indexed="true" stored="true" />
<field name="focal_35" type="string" indexed="true" stored="true" />
<dynamicField name="ignored_*" type="string" indexed="false" stored="false" multiValued="true" />

但是当我爬网时，没有索引到solr中的数据。我找不到与此有关的任何文档/教程。我也阅读了关于stackoverflow的一些文章，以使用nutch进行图像爬网。但是我没有发现有帮助。

有人可以引导我朝正确的方向前进吗？预先感谢。

Answer 1

这个问题没有简单/简短的答案，即使不涉及爬网部分，解析图像也是一项棘手的工作。除了已完成的操作外，您还需要首先启用parse-tika插件（parse-html仅处理HTML文档）。 Apache Tika能够提取有关图像的一些元数据。

您还需要启用mimetype-filter插件（这不仅是编辑配置文件，而且是在nutch-site.xml文件中启用）。完成这些配置后，您应该尝试使用bin/nutch parsechecker <URL>工具来测试包含某些图像的URL，并查看是否可以在Outlinks部分中找到图像的URL。另外，请检查对图像URL运行parsechecker，以查看parsechecker提取了哪些元数据。之后，对两个URL运行bin/nutch indexchecker工具，并检查它要索引到Solr中的哪些字段并相应地在架构中创建这些字段。请记住，Tika可能会为每种格式提取不同的元数据。

使用八卦抓取图像及其元数据并将其编入solr

1 个答案: