如何限制类似网址的重复爬网

时间:2018-12-03 16:59:32

标签: web-crawler stormcrawler

使用Storm Crawler 1.10和ES 6.4.2。抓取过程完成后,当我检查记录时,抓取工具会抓取具有相同标题和描述的 https http 网址,我如何才能告诉抓取工具仅抓取其中一个网址。

Title: About Apache storm
Description:A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.
url: https://www.someurl.com


Title: About Apache storm
Description:A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.
url: http://www.someurl.com

1 个答案:

答案 0 :(得分:0)

这些变体通常由站点作为重定向进行管理,因此您只会得到一个文档。或者,站点可以提供a canonical tag,StormCrawler会使用instead of the normalised URL作为URL值。

StormCrawler单独查看文档,并且不了解其他URL。您可以通过以下方式在SC之外实现此功能:

  1. 查询索引时折叠结果
  2. 例如,使用MapReduce重复删除索引的内容

SC中用于处理所有剩余重复项的一个选项是生成自定义元数据,例如内容的哈希值,然后修改ES索引器螺栓,以使其使用该值(如果文档ID为enter image description here)。然后,您将获得一个文档,但是无法选择要使用哪个URL(http或https)。