使用Storm Crawler 1.13和Elastic Search 6.5.2。在TextExtractor中工作。我也想排除 header 标签,因此排除了 script 和 style 标签。我正在应用以下配置,但不适用于所有结果。我想保留 h1 , h2 , h3 仅删除 header 命名标签。有什么建议么。
网页:
<header id="section-header" class="section section-header">
</header>
<h1 class="title" id="page-title">Good Morning..</h1>
crawlerconf.yaml
textextractor.include.pattern:
- DIV[id="maincontent"]
- DIV[itemprop="articleBody"]
- ARTICLE
textextractor.exclude.tags:
- STYLE
- SCRIPT
- HEADER
- FOOTER
答案 0 :(得分:2)
我无法在本地计算机上重现您的问题。这可能是您一方的配置缺陷,或者您所引用的网站很特殊。
您是否确认自定义crawler-conf.yaml
已正确加载,并且textextractor.exclude.tags
已包含在已加载的配置中?
我按照以下步骤尝试重现您的问题:
1.13
版本发布源。TextExtractorTest.java
添加了以下单元测试: @Test
public void testRemoveHeaderElements() throws IOException {
Config conf = new Config();
HashSet<String> excluded = new HashSet<>();
excluded.add("HEADER");
excluded.add("FOOTER");
excluded.add("SCRIPT");
excluded.add("STYLE");
conf.put(TextExtractor.EXCLUDE_PARAM_NAME, PersistentVector.create(excluded));
HashSet<String> included = new HashSet<>();
included.add("DIV[id=\"maincontent\"]");
included.add("DIV[itemprop=\"articleBody\"]");
included.add("ARTICLE");
conf.put(TextExtractor.INCLUDE_PARAM_NAME, PersistentVector.create(included));
TextExtractor extractor = new TextExtractor(conf);
String content = "<header id=\"section-header\" class=\"section section-header\"></header><h1 class=\"title\" id=\"page-title\">Good Morning..</h1>";
Document jsoupDoc = Parser.htmlParser().parseInput(content,
"http://stormcrawler.net");
String text = extractor.text(jsoupDoc.body());
assertEquals("Good Morning..", text);
}
此对 HashSet<String> included = new HashSet<>();
included.add("DIV[id=\"maincontent\"]");
included.add("DIV[itemprop=\"articleBody\"]");
included.add("ARTICLE");
conf.put(TextExtractor.INCLUDE_PARAM_NAME, PersistentVector.create(included));
TextExtractor extractor = new TextExtractor(conf);
String content = "<header id=\"section-header\" class=\"section section-header\"></header><h1 class=\"title\" id=\"page-title\">Good Morning..</h1>";
Document jsoupDoc = Parser.htmlParser().parseInput(content,
"http://stormcrawler.net");
String text = extractor.text(jsoupDoc.body());
assertEquals("Good Morning..", text);
}
组件的单元测试通过了。接下来,我确实将带有以下HTML代码的网站上传到本地部署的Web服务器:
TextExtractor
提取的文本内容为:<header id="section-header" class="section section-header">
</header>
Good Morning..
,根据您的要求应该可以。