禁用手机号码搜索

时间:2015-07-17 18:59:33

标签: solr

我有一个名为verbatim的solr字段,里面包含句子和手机号码。我使用text_general数据类型进行逐字。

要求是不应在手机号码上搜索逐字字段(格式XXX-XXX-XXXX)。 以下是我的想法。

  1. 在发送给solr之前,请使用模式匹配电话号码并将号码替换为""然后正常索引。但这意味着我们正在修改内容。而且,由于记录数以百万计,因此在每个记录的java中都这样做,可能会耗费额外的时间。

  2. 允许将数据发送到Solr,并使用schema.xml中的模式过滤器进行字段定义(text_general_vision)以识别电话号码,如下所示。但我仍然可以使用XXX或XXX-XXX-XXXX进行搜索。任何有助于识别问题的帮助表示赞赏。提前致谢。

    <fieldType name="text_general_vision" class="solr.TextField" positionIncrementGap="100">
     <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.PatternReplaceFilterFactory" pattern="\\d{3}-\\d{3}-\\d{4}" replacement="" replace="all" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="\\d{3}-\\d{3}-\\d{4}" replacement="" replace="all" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    

1 个答案:

答案 0 :(得分:2)

问题是您提供的过滤器在标记化后运行。这意味着它永远不会看到完整的电话号码,因为当它被source 'https://rubygems.org' # Bundle edge Rails instead: gem 'rails', github: 'rails/rails' gem 'rails', '4.1.0' # Use sqlite3 as the database for Active Record gem 'sqlite3', group: [:development, :test] # Use postgresql as the database for production group :production do gem 'pg' gem 'rails_12factor' end # Use SCSS for stylesheets gem 'sass-rails', '4.0.3' # Use bootstrap library for styles gem 'bootstrap-sass', '3.3.1' # Use font awesome library for icons gem 'font-awesome-sass', '4.2.0' # Use Uglifier as compressor for JavaScript assets gem 'uglifier', '1.3.0' # Use CoffeeScript for .js.coffee assets and views gem 'coffee-rails', '4.0.0' # See https://github.com/sstephenson/execjs#readme for more supported runtimes # gem 'therubyracer', platforms: :ruby # Use jquery as the JavaScript library gem 'jquery-rails' # Turbolinks makes following links in your web application faster. Read more: https://github.com/rails/turbolinks # gem 'turbolinks' # Build JSON APIs with ease. Read more: https://github.com/rails/jbuilder gem 'jbuilder', '2.0' # bundle exec rake doc:rails generates the API under doc/api. gem 'sdoc', '0.4.0', group: :doc # Spring speeds up development by keeping your application running in the background. Read more: https://github.com/rails/spring gem 'spring', group: :development # Use devise for user auth gem 'devise', '~>3.4.1' # Use stripe for handling payments gem 'stripe', '1.16.1' # Use figaro to hide secret keys gem 'figaro', '1.0.0' # Use ActiveModel has_secure_password # gem 'bcrypt', '3.1.7' # Use unicorn as the app server # gem 'unicorn' # Use Capistrano for deployment # gem 'capistrano-rails', group: :development # Use debugger # gem 'debugger', group: [:development, :test] 分隔时,它会被StandardTokenizer拆分为单独的令牌。

您可以apply a PatternReplaceCharFilter before tokenization happens,这将允许您删除任何与正则表达式匹配的模式。

请记住,您仍然会为每条记录执行此操作(因为您必须为每条记录或每个查询执行此操作 - 记录通常少于查询数量,但是YMMV),但逻辑发生在Solr端,而不是必须始终更新每个索引方法。

请记住,如果存储该字段,电话号码仍然可用,但这似乎不是问题。