SOLR 4.0字母排序故障

时间:2012-11-13 12:23:52

标签: solr

我很难解决我在SOLR地址数据库中遇到的问题。

我从示例文件中构建了这个。我基本上是使用修改过的架构运行示例配置。

schema.xml中

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="long" indexed="true" stored="true" required="false" multiValued="false" />

<field name="givenname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="middleinitial_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="surname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="gender_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="pictureuri_s" type="string" indexed="false" stored="true" required="false" multiValued="false" />
<field name="function_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunit_s" type="text_general" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunitdescription_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="company_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="street_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="streetnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="postcode_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="city_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="building_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="roomnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="country_s" type="text_en" indexed="true" stored="true" required="true" multiValued="false" />
<field name="countrycode_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="emailaddress_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone1_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone2_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="mobile_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="fax_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />

我通过将大约20.000个随机测试数据集推送到 post.jar 来填充数据库:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<add>
    <doc>
        <field name="id">1352498443_1</field>
        <field name="givenname_s">Aynur</field>
        <field name="middleinitial_s"/>
        <field name="surname_s">Lehnen</field>
        <field name="gender_s">F</field>
        <field name="pictureuri_s">dummy_assets/female.jpg</field>
        <field name="function_s">Zugschaffner/in</field>
        <field name="organizationalunit_s">P 07</field>
        <field name="organizationalunitdescription_s">Lorem Ipsum sadipscing voluptua ipsum invidunt dolor et dolore invidunt sed consetetur accusam dolore Lorem tempor.</field>
        <field name="company_s">Lorem Lagna Epsum Emet</field>
        <field name="street_s">Erlenweg</field>
        <field name="streetnumber_s">82</field>
        <field name="postcode_s">76297</field>
        <field name="city_s">Lübeck</field>
        <field name="building_s"/>
        <field name="roomnumber_s">242</field>
        <field name="country_s">GERMANY</field>
        <field name="countrycode_s">DE</field>
        <field name="emailaddress_s">aynur.lehnen@lorem-lagna-epsum-emet.de</field>
        <field name="phone1_s">0392984823</field>
        <field name="phone2_s">0124111417</field>
        <field name="mobile_s">0325117132</field>
        <field name="fax_s">0171459177</field>
    </doc>
</add>

然而,当检索数据时,我似乎遇到按字母顺序排序的问题。考虑以下查询:

{
    "responseHeader": {
        "status": 0,
            "QTime": 5,
            "params": {
            "sort": "surname_s asc",
                "fl": "surname_s",
                "indent": "true",
                "wt": "json",
                "q": "city_s:berlin"
        }
    },
        "response": {
        "numFound": 1094,
        "start": 0,
        "docs": [{
            "surname_s": "Weil"
        }, {
            "surname_s": "Abel"
        }, {
            "surname_s": "Adam"
        }, {
            "surname_s": "Ade"
        }, {
            "surname_s": "Adrian"
        }, {
            "surname_s": "Aigner"
        }, {
            "surname_s": "Aigner"
        }, {
            "surname_s": "Alber"
        }, {
            "surname_s": "Alber"
        }, {
            "surname_s": "Albers"
        }]
    }
}

为什么“Weil”位于第一位,而其余数据似乎排序正确?

3 个答案:

答案 0 :(得分:14)

我认为在text_de字段类型中应用的一些其他分析器是导致此排序行为的原因。根据我的经验,排序字符串时的最佳结果是使用下面显示的示例schema.xml附带的alphaOlySort fieldType。

<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <!-- KeywordTokenizer does no actual tokenizing, so the entire
         input string is preserved as a single token
      -->
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- The LowerCase TokenFilter does what you expect, which can be
         when you want your sorting to be case insensitive
      -->
    <filter class="solr.LowerCaseFilterFactory" />
    <!-- The TrimFilter removes any leading or trailing whitespace -->
    <filter class="solr.TrimFilterFactory" />
    <!-- The PatternReplaceFilter gives you the flexibility to use
         Java Regular expression to replace any sequence of characters
         matching a pattern with an arbitrary replacement string, 
         which may include back references to portions of the original
         string matched by the pattern.

         See the Java Regular Expression documentation for more
         information on pattern and replacement string syntax.

         http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html
      -->
    <filter class="solr.PatternReplaceFilterFactory"
            pattern="([^a-z])" replacement="" replace="all"
    />
  </analyzer>
</fieldType>

我建议创建一个新字段,然后通过copyField从surname_s复制值,如下所示:

 <field name="surname_s_sort" type="alphaOnlySort" indexed="true" stored="false" required="false" multiValued="false" />

 <copyField source="surname_s" dest="surname_s_sort"/>

注意:除非您希望将该值显示给用户,否则无需将值存储在surname_s_sort字段中,因此需要stored="false"属性。< / p>

然后,您只需更改查询即可对surname_s_sort进行排序。

答案 1 :(得分:4)

排序在多值和标记化字段上不起作用。

Documentation -
可以在文档的“得分”上进行排序,或者在任何multiValued =“false”indexed =“true”字段上进行排序,前提是该字段是非标记化的(即:没有分析器)或使用仅生成分析的分析器单个术语(即:使用KeywordTokenizer)

使用字符串作为字段类型,并将标题字段复制到新字段中。

<field name="surname_s_sort" type="string" indexed="true" stored="false"/>

<copyField source="surname_s" dest="surname_s_sort" />  

正如@Paige所回答的那样,您可以使用关键字标记器,小写过滤器不会对字段进行标记。

答案 2 :(得分:0)

我有类似的问题,我尝试了alphaOnlySort。这部分工作,但当字段包含像 - ,/空格等值时,它会开始弄乱排序结果。

所以结果就像是

  1. / abc
  2. AA
  3. / abc2
  4. 所以我最终使用字段类型小写。它已经存在,所以我认为它是默认类型。我确实使用了复制字段构造,所以我的最终配置是:

    <schema>
        <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
          </analyzer>
        </fieldType>
        <fields>
           <field name="job_name_sort" type="lowercase" indexed="true" stored="false" required="false"/>
        </fields>
        <copyField source="job_name" dest="job_name_sort"/>
    </schema>