我有一个名为verbatim的solr字段,里面包含句子和手机号码。我使用text_general数据类型进行逐字。
要求是不应在手机号码上搜索逐字字段(格式XXX-XXX-XXXX)。 以下是我的想法。
在发送给solr之前,请使用模式匹配电话号码并将号码替换为""然后正常索引。但这意味着我们正在修改内容。而且,由于记录数以百万计,因此在每个记录的java中都这样做,可能会耗费额外的时间。
允许将数据发送到Solr,并使用schema.xml中的模式过滤器进行字段定义(text_general_vision)以识别电话号码,如下所示。但我仍然可以使用XXX或XXX-XXX-XXXX进行搜索。任何有助于识别问题的帮助表示赞赏。提前致谢。
<fieldType name="text_general_vision" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.PatternReplaceFilterFactory" pattern="\\d{3}-\\d{3}-\\d{4}" replacement="" replace="all" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="\\d{3}-\\d{3}-\\d{4}" replacement="" replace="all" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
答案 0 :(得分:2)
问题是您提供的过滤器在标记化后运行。这意味着它永远不会看到完整的电话号码,因为当它被source 'https://rubygems.org'
# Bundle edge Rails instead: gem 'rails', github: 'rails/rails'
gem 'rails', '4.1.0'
# Use sqlite3 as the database for Active Record
gem 'sqlite3', group: [:development, :test]
# Use postgresql as the database for production
group :production do
gem 'pg'
gem 'rails_12factor'
end
# Use SCSS for stylesheets
gem 'sass-rails', '4.0.3'
# Use bootstrap library for styles
gem 'bootstrap-sass', '3.3.1'
# Use font awesome library for icons
gem 'font-awesome-sass', '4.2.0'
# Use Uglifier as compressor for JavaScript assets
gem 'uglifier', '1.3.0'
# Use CoffeeScript for .js.coffee assets and views
gem 'coffee-rails', '4.0.0'
# See https://github.com/sstephenson/execjs#readme for more supported runtimes
# gem 'therubyracer', platforms: :ruby
# Use jquery as the JavaScript library
gem 'jquery-rails'
# Turbolinks makes following links in your web application faster. Read more: https://github.com/rails/turbolinks
# gem 'turbolinks'
# Build JSON APIs with ease. Read more: https://github.com/rails/jbuilder
gem 'jbuilder', '2.0'
# bundle exec rake doc:rails generates the API under doc/api.
gem 'sdoc', '0.4.0', group: :doc
# Spring speeds up development by keeping your application running in the background. Read more: https://github.com/rails/spring
gem 'spring', group: :development
# Use devise for user auth
gem 'devise', '~>3.4.1'
# Use stripe for handling payments
gem 'stripe', '1.16.1'
# Use figaro to hide secret keys
gem 'figaro', '1.0.0'
# Use ActiveModel has_secure_password
# gem 'bcrypt', '3.1.7'
# Use unicorn as the app server
# gem 'unicorn'
# Use Capistrano for deployment
# gem 'capistrano-rails', group: :development
# Use debugger
# gem 'debugger', group: [:development, :test]
分隔时,它会被StandardTokenizer拆分为单独的令牌。
您可以apply a PatternReplaceCharFilter before tokenization happens,这将允许您删除任何与正则表达式匹配的模式。
请记住,您仍然会为每条记录执行此操作(因为您必须为每条记录或每个查询执行此操作 - 记录通常少于查询数量,但是YMMV),但逻辑发生在Solr端,而不是必须始终更新每个索引方法。
请记住,如果存储该字段,电话号码仍然可用,但这似乎不是问题。