我正在使用Solr 5.3.1和tika来提取pdf以进行索引。这个过程有效,但它包含了很多换行符。无论如何使用分析器删除那些换行符?
以下是我的分析器代码:
## do what we want with the result, for example bind all the route coordinates into one data.frame
df <- do.call(rbind, lapply(lst, function(x) x[['route']]))
head(df)
lat lon id
1 40.71938 -73.99323 40.7193+-73.993 40.7096+-73.949
2 40.71992 -73.99292 40.7193+-73.993 40.7096+-73.949
3 40.71984 -73.99266 40.7193+-73.993 40.7096+-73.949
4 40.71932 -73.99095 40.7193+-73.993 40.7096+-73.949
5 40.71896 -73.98981 40.7193+-73.993 40.7096+-73.949
6 40.71824 -73.98745 40.7193+-73.993 40.7096+-73.949
我试图以CharFilter为例,将换行符(\ n)放入stopwords_en.txt。它没有用。我也尝试过solr.MappingCharFilterFactory。我试图放任何一个
<analyzer type="query">
<!--<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>-->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\\n])" replacement="" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="\u000A" replacement="," />
<!--<Filter class="solr.PatternReplaceCharFilterFactory" pattern="([\\n])" replacement="" replace="all"/>-->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:-->
<!--<filter class="solr.EnglishMinimalStemFilterFactory"/>-->
<!--<filter class="solr.PorterStemFilterFactory"/>-->
</analyzer>
或"\n"=> "<br>"
。它也没有用。
有人可以帮忙删除换行符吗?
谢谢
答案 0 :(得分:1)
这是您的查询时分析器,它是用户提交查询时运行的分析器。您的Tika后期处理发生在索引时间分析器中。所以,尝试在那里定义它。我认为 PatternReplaceCharFilterFactory 应该足够了。或者,您可以查看TrimFilterFactory。