如何使用distinct + subselect在SOLR中创建查询(在...中选择)

时间:2013-10-25 11:16:08

标签: select solr distinct subquery

下面的问题:

SOLR中的数据结构:

<field name="id" type="string" required="true"/> 
<field name="session_id" type="string" required="true"/> 
<field name="action_type" required="true"/> 
<field name="error_msg" required="false"/>

(所有字段都有:indexed =“true”stored =“true”multiValued =“false”) 只需要'错误'字段(可以为空)。

oracle中有等效表:

TABLE SOLR_TEST
  (
    ID          NUMBER NOT NULL ,
    SESSION_ID  VARCHAR2(20 BYTE) NOT NULL ,
    ACTION_TYPE VARCHAR2(20 BYTE) NOT NULL ,
    ERROR_MSG   VARCHAR2(20 BYTE)
  );

有样本数据(SOLR和Oracle相同)

ID SESSION_ID           ACTION_TYPE          ERROR_MSG          
-- -------------------- -------------------- --------------------
 1 00001                SELECTED_ACTION                           
 2 00001                SELECTED_ACTION                           
 3 00001                OTHER                                     
 4 00002                A2                   ERROR_001            
 5 00002                OTHER                                     
 6 00003                SELECTED_ACTION      ERROR_002            
 7 00004                A1                   ERROR_001            
 8 00005                A2                                        
 9 00005                SELECTED_ACTION                           
10 00005                SELECTED_ACTION      ERROR_003            
11 00006                SELECTED_ACTION                           
12 00006                OTHER                ERROR_004            

问题:

如何在SOLR查询中创建将返回: 所有session_id已指定action_type但永远不会发生指定action_type非空error_msg

或Oracle中此查询的等效内容:

select distinct session_id 
    from SOLR_TEST 
    where action_type='SELECTED_ACTION' 
    and not session_id in 
      ( select session_id 
        from SOLR_TEST 
        where action_type='SELECTED_ACTION' 
              and error_msg is not null
      );

此查询的结果是:

SESSION_ID         
--------------------
00001                
00006                

e.g。像这样的SOLR查询正在工作:

http://solrhost/solr/collection/select?rows=1&q=-(error_msg:[*+TO+*]+AND+action_type:SELECTED_ACTION)&wt=xml&indent=true&facet=true&facet.field=session_id&facet.zeros=false&fq=action_type:SELECTED_ACTION

//编辑/////////////////////////////////////

真正的架构看起来像这样:

<schema name="elogging" version="1.5">
  <fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
    <field name="action_type" type="string" indexed="true" stored="true" required="false" multiValued="false"/>
    <field name="session_id" type="string" indexed="true" stored="true" required="false" multiValued="false"/>
    <field name="error_msg" type="string" indexed="true" stored="true" required="false" multiValued="false"/>
    <field name="_version_" type="long" indexed="true" stored="true"/>
  </fields>
  <uniqueKey>id</uniqueKey>
  <types>
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="uuid" class="solr.UUIDField" indexed="true"/>
  </types>
  <updateRequestProcessorChain name="uniq-fields">
    <processor class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory">
      <lst name="fields">
        <str>id</str>
      </lst>
    </processor>
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>
</schema>

//编辑2 //////////////////////

SOLR查询无法正常工作 - 此SOLR查询返回类似于:

的内容
select distinct session_id 
from SOLR_TEST 
where action_type='SELECTED_ACTION' 
and error_msg is null;

SESSION_ID         
--------------------
00001                
00005                
00006

值'00005'错误,因为有一行:

10 00005                SELECTED_ACTION      ERROR_003            

//编辑3 ////////////

此SOLR查询也无效(与之前相同的问题):

http://solrhost/solr/collection/select?rows=1&q=action_type:SELECTED_ACTION+AND+-{!join+from=session_id+to=session_id}error_msg:*+AND+action_type:SELECTED_ACTION&wt=xml&indent=true&facet=true&facet.field=session_id&facet.zeros=false

//编辑4 ///////

*修复架构 - 'error_msg'已编入索引*

//编辑5 /////

您有SOLR的样本数据:

id,session_id,action_type,error_msg
1,00001,SELECTED_ACTION,
2,00001,SELECTED_ACTION,
3,00001,OTHER,
4,00002,A2,ERROR_001
5,00002,OTHER,
6,00003,SELECTED_ACTION,ERROR_002
7,00004,A1,ERROR_001
8,00005,A2,
9,00005,SELECTED_ACTION,
10,00005,SELECTED_ACTION,ERROR_003
11,00006,SELECTED_ACTION,
12,00006,OTHER,ERROR_004

SOLR对此数据和查询的结果http://localhost:8983/solr/collection3/select?rows=1&q=-(error_msg:[*+TO+*]+AND+action_type:SELECTED_ACTION)&wt=xml&indent=true&facet=true&facet.field=session_id&facet.zeros=false&fq=action_type:SELECTED_ACTION

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">30</int>
<lst name="params">
<str name="facet.zeros">false</str>
<str name="facet">true</str>
<str name="indent">true</str>
<str name="q">
-(error_msg:[* TO *] AND action_type:SELECTED_ACTION)
</str>
<str name="facet.field">session_id</str>
<str name="wt">xml</str>
<str name="fq">action_type:SELECTED_ACTION</str>
<str name="rows">1</str>
</lst>
</lst>
<result name="response" numFound="4" start="0">
<doc>
<str name="id">1</str>
<str name="session_id">00001</str>
<str name="action_type">SELECTED_ACTION</str>
<long name="_version_">1449881246216749056</long>
</doc>
</result>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="session_id">
<int name="00001">2</int>
<int name="00005">1</int>
<int name="00006">1</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
</response>

1 个答案:

答案 0 :(得分:0)

这有点棘手,因为据我所知(如果有人能证明这是错误的话我会很高兴) - 在另一个查询中重用部分的查询结果是不可能的(例如过滤查询或嵌套查询。)

所以,这是我目前所能得到的:

<强>查询

http://localhost:8983/solr/stack19588325/select?q=action_type%3A%22SELECTED_ACTION%22&fq=%7B!tag%3Ddt%7Daction_type%3ASELECTED_ACTION+AND+error_msg%3A%5B*+TO+*%5D+AND+_query_%3A%7B!join+from%3Dsession_id+to%3Dsession_id+v%3D%24qq%7D&rows=0&wt=xml&indent=true&facet=true&facet.mincount=1&facet.field={!ex=dt%20key=nonfilter_session_id}session_id&facet.field=session_id&qq=-error_msg:[*%20TO%20*]

<强>结果

<response>    
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="qq">-error_msg:[* TO *]</str>
    <str name="q">action_type:"SELECTED_ACTION"</str>
    <arr name="facet.field">
      <str>{!ex=dt key=nonfilter_session_id}session_id</str>
      <str>session_id</str>
    </arr>
    <str name="indent">true</str>
    <str name="fq">{!tag=dt}action_type:SELECTED_ACTION AND error_msg:[* TO *] AND _query_:{!join from=session_id to=session_id v=$qq}</str>
    <str name="facet.mincount">1</str>
    <str name="rows">0</str>
    <str name="wt">xml</str>
    <str name="facet">true</str>
    <str name="_">1382878844535</str>
  </lst>
</lst>
<result name="response" numFound="1" start="0">
</result>
<lst name="facet_counts">
  <lst name="facet_queries"/>
  <lst name="facet_fields">
    <lst name="nonfilter_session_id">
      <int name="00001">2</int>
      <int name="00005">2</int>
      <int name="00003">1</int>
      <int name="00006">1</int>
    </lst>
    <lst name="session_id">
      <int name="00005">1</int>
    </lst>
  </lst>
  <lst name="facet_dates"/>
  <lst name="facet_ranges"/>
</lst>
</response>

因此,如您所见,我们有两个不同的方面结果:

  • nonfilter_session_id - 显示没有error_msg的“session_id”。计数 - 是session_id记录的总数。
  • session_id - 这显示了两个都具有AND并且没有error_msg的“session_id”(00005就是这种情况)。计数 - 是session_id的错误,带有error_msg。

所以,如果没有更好的选择 - 你可以建立这两个集合的交集,并且只会有那些期望的session_id。