Solr tika删除换行符

时间:2016-12-10 00:27:05

标签: solr apache-tika

我正在使用Solr 5.3.1和tika来提取pdf以进行索引。这个过程有效,但它包含了很多换行符。无论如何使用分析器删除那些换行符?

以下是我的分析器代码:

## do what we want with the result, for example bind all the route coordinates into one data.frame
df <- do.call(rbind, lapply(lst, function(x) x[['route']]))
head(df)
       lat       lon                              id
1 40.71938 -73.99323 40.7193+-73.993 40.7096+-73.949
2 40.71992 -73.99292 40.7193+-73.993 40.7096+-73.949
3 40.71984 -73.99266 40.7193+-73.993 40.7096+-73.949
4 40.71932 -73.99095 40.7193+-73.993 40.7096+-73.949
5 40.71896 -73.98981 40.7193+-73.993 40.7096+-73.949
6 40.71824 -73.98745 40.7193+-73.993 40.7096+-73.949

我试图以CharFilter为例,将换行符(\ n)放入stopwords_en.txt。它没有用。我也尝试过solr.MappingCharFilterFactory。我试图放任何一个 <analyzer type="query"> <!--<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>--> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\\n])" replacement="" /> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="\u000A" replacement="," /> <!--<Filter class="solr.PatternReplaceCharFilterFactory" pattern="([\\n])" replacement="" replace="all"/>--> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:--> <!--<filter class="solr.EnglishMinimalStemFilterFactory"/>--> <!--<filter class="solr.PorterStemFilterFactory"/>--> </analyzer> "\n"=> "<br>"。它也没有用。

有人可以帮忙删除换行符吗?

谢谢

1 个答案:

答案 0 :(得分:1)

这是您的查询时分析器,它是用户提交查询时运行的分析器。您的Tika后期处理发生在索引时间分析器中。所以,尝试在那里定义它。我认为 PatternReplaceCharFilterFactory 应该足够了。或者,您可以查看TrimFilterFactory