使用Storm Crawler 1.10和ES 6.4.2。抓取过程完成后,当我检查记录时,抓取工具会抓取具有相同标题和描述的 https 和 http 网址,我如何才能告诉抓取工具仅抓取其中一个网址。
Title: About Apache storm
Description:A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.
url: https://www.someurl.com
Title: About Apache storm
Description:A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.
url: http://www.someurl.com
答案 0 :(得分:0)
这些变体通常由站点作为重定向进行管理,因此您只会得到一个文档。或者,站点可以提供a canonical tag,StormCrawler会使用instead of the normalised URL作为URL值。
StormCrawler单独查看文档,并且不了解其他URL。您可以通过以下方式在SC之外实现此功能:
SC中用于处理所有剩余重复项的一个选项是生成自定义元数据,例如内容的哈希值,然后修改ES索引器螺栓,以使其使用该值(如果文档ID为)。然后,您将获得一个文档,但是无法选择要使用哪个URL(http或https)。