从原始文本到分析器再到标记器到过滤器再到原始文本:solr如何?

时间:2018-03-22 10:17:08

标签: filter solr reference tokenize analyzer

考虑对阿尔伯特爱因斯坦维基百科页面第一句的分析:

http://localhost:8983/solr/#/trans/analysis?analysis.fieldvalue=Albert%20Einstein%20(14%20March%201879%20%E2%80%93%2018%20April%201955)%20was%20a%20German-born%20theoretical%20physicist%5B5%5D%20who%20developed%20the%20theory%20of%20relativity,%20one%20of%20the%20two%20pillars%20of%20modern%20physics%20(alongside%20quantum%20mechanics)&analysis.fieldtype=text_en&verbose_output=0

及其输出:

enter image description here

问题:有没有办法从solr以半限制的方式获得这个?最后,我很有兴趣将原始文本中的字符序列引用到最后一行的确切标记。

2 个答案:

答案 0 :(得分:1)

Solr中的Web界面是一个瘦HTML / Javascript应用程序,通过调用Solr的REST接口来执行任何实际工作。如果您在要求网络界面执行分析时在浏览器中观看网络选项卡,您可以看到它正在向以下网址发出请求:

http://localhost:8080/solr/corename/analysis/field?wt=json&analysis.showmatch=true&analysis.fieldvalue=foo%20bar&analysis.query=foo%20bar&analysis.fieldtype=text_no

响应是用于构建您看到的UI的JSON结构:

{
  "responseHeader":{
    "status":0,
    "QTime":108
  },
  "analysis":{
    "field_types":{
      "text_no":{
        "index":[
          "org.apache.lucene.analysis.standard.StandardTokenizer",
          [
            {
              "text":"foo",
              "raw_bytes":"[66 6f 6f]",
              "match":true,
              "start":0,
              "end":3,
              "org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength":1,
              "type":"<ALPHANUM>",
              "position":1,
              "positionHistory":[
                1
              ]
            },
            {
              "text":"bar",
              "raw_bytes":"[62 61 72]",
              "match":true,
              "start":4,
              "end":7,
              "org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength":1,
              "type":"<ALPHANUM>",
              "position":2,
              "positionHistory":[
                2
              ]
            }
          ],
          // .....
        ],
        "query":[
          "org.apache.lucene.analysis.standard.StandardTokenizer",
          [
             // ....
          ]
        ]
      }
    },
    "field_names":{

    }
  }
}

然后,您可以遍历indexquery键并选择所需的条目(last / first / etc.)

Solr版本之间的URL和响应格式可能已经更改,但我确信它在上一个主要版本中保持稳定。

答案 1 :(得分:1)

您还可以使用term-vector-component检索您要查找的内容。假设您在solrconfig.xml中启用了组件(该文件必须包含以下行:)

  <searchComponent name="tvComponent" class="solr.TermVectorComponent"/>

  <requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy">
    <lst name="defaults">
      <bool name="tv">true</bool>
    </lst>
    <arr name="last-components">
      <str>tvComponent</str>
    </arr>
  </requestHandler>

并且Schema必须正确配置组件(此处的类型与德语文本匹配):

<field name="trans" 
  type="text_de" 
  indexed="true" 
  termOffsets="true" 
  stored="true" 
  termPositions="true" 
  termVectors="true" 
  multiValued="true"/>

您可以使用

检索相应的值
http://localhost:8983/solr/trans/tvrh?q=trans:tag&rows=1&indent=true&tv.all=true&wt=xml

典型输出

<lst name="zweck">
  <int name="tf">1</int> <- term frequency
  <lst name="positions">
    <int name="position">7</int> <!-- 7th word                       -->
  </lst>
  <lst name="offsets">
    <int name="start">45</int>   <!-- 45th byte in the original text -->
    <int name="end">52</int>     <!-- 52 byte                        -->
  </lst>
  <int name="df">7</int>         <!-- 7 documents have the term      -->
  <double name="tf-idf">0.14285714285714285</double> <-- 1/7         -->
</lst>