Question

我正在使用stormcrawler（v 1.16）搜寻新闻网站，并将数据存储在Elasticsearch（v 7.5.0）上。我的crawler-conf文件的格式为stormcrawler files。我正在使用kibana进行可视化。我的问题是

在搜寻新闻网站时，我只需要文章内容的网址，但我也要获取广告的网址，网站上的其他标签。我需要在什么地方进行更改 Kibana link
如果我只需要从URL中获取特定内容（例如仅标题或仅内容），我们该怎么做。

编辑：我正在考虑在内容索引中添加一个字段。因此，我在src / main / resources / parsefilter.json，ES_IndecInit.sh和Crawler-conf.yaml中进行了更改。我添加的XPATH是正确的。我已添加为

"parse.pubDate":"//META[@itemprop=\"datePublished\"]/@content"

在parsefilter中

。

parse.pubDate =PublishDate

在crawler-conf中并添加了

PublishDate": { "type": "text", "index": false, "store": true}

在ES_IndexInit.sh的属性中。但是我仍然没有在kibana或elasticsearch中得到任何名为PublishDate的字段。 ES_IndexInit.sh映射如下：

{
  "mapping": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "PublishDate": {
        "type": "text",
        "index": false,
        "store": true
      },
      "content": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "description": {
        "type": "text",
        "store": true
      },
      "domain": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "host": {
        "type": "keyword",
        "store": true
      },
      "keywords": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title": {
        "type": "text",
        "store": true
      },
      "url": {
        "type": "keyword",
        "store": true
      }
    }
  }
}

Answer 1

仅索引站点新闻页面的一种方法是依靠站点地图，但并非所有站点都提供这些地图。

或者，在解析过程中，您可能需要一种机制（可能在ParseFilter中），以根据索引期间元数据中键/值的存在来确定页面是新闻项并进行过滤。 / p>

在news crawl dataset from CommonCrawl中完成的方式是种子URL是站点地图或RSS feed。

要不将内容编入索引，只需注释掉

  indexer.text.fieldname: "content"

在配置中。

如何使用StormCrawler从网站抓取特定数据

1 个答案: