Question

我有一个带有4字段复合键的CQL表，我想在Solr中编制索引。所有4个复合PK字段都是＆＃39; text＆＃39;输入CQL和＆＃39; string＆＃39;输入Solr;其中2个可能包含长字符串。当我初始化Solr核心时，我在system.log中看到了很多以下警告消息：

pastebin

实际的消息比这长得多（一行中有200000多个字符），但为了便于阅读，我将其截断。从初始化核心到索引过程过早终止（是的，Solr无法索引我的数据）时，这种警告的持续流动会泛滥我的日志文件。

来自MySQL背景，我知道PK的最大长度（MySQL中为700字节）;所以，即使在Cassandra或Solr文档中没有提到类似限制，我做的第一件事就是用一个简单的文本键替换CQL复合键，该文本键包含之前4个字段的sha-1哈希值部分化合物PK。 Viola - 警告消失了，Solr能够索引我的数据。所以我现在的问题是，Solr对uniqueKey的长度有限制吗？ Cassandra似乎没有长复合PK的问题（因为我能够通过CQL查询我的一些数据），但Solr似乎有一个限制。

更新

经过进一步的测试，我发现不知何故，它是我的表模式中复合PK和CQL映射的混合，导致了Solr索引问题。

复合PK +没有地图（由多列代替）=作品
简单PK（复合PK列的sha-1哈希）+ maps = works
复合PK +贴图=不起作用

我仍然不确定问题是否与我的数据长度有任何关系。

CQL表架构：

CREATE TABLE myks.mycf (
  phrase text,
  host text,
  domain text,
  path text,
  created timestamp,
  modified timestamp,

  attr1 int,
  attr2 bigint,
  attr3 double,
  attr4 int,
  attr5 bigint,
  attr6 bigint,
  attr7 double,
  attr8 double,

  scores map<text,int>,
  estimates map<text,bigint>,
  searches map<text,bigint>,

  PRIMARY KEY (phrase,domain,host,path),
) WITH gc_grace_seconds = 1296000
AND compaction={'class': 'LeveledCompactionStrategy'}
AND compression={'sstable_compression': 'LZ4Compressor'}

Solr架构：

<schema name="myks" version="1.5">
  <types>
    <fieldType name="text" class="solr.TextField">
     <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
     </analyzer>
    </fieldType>
    <fieldType name="string" class="solr.StrField" omitNorms="true"/>
    <fieldType name="boolean" class="solr.BoolField" omitNorms="true"/>
    <fieldtype name="binary" class="solr.BinaryField"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
  </types>
  <fields>
    <field name="phrase" type="string" indexed="true" stored="true"/>
    <field name="host" type="string" indexed="true" stored="true"/>
    <field name="domain" type="string" indexed="true" stored="true"/>
    <field name="path" type="string" indexed="true" stored="true"/>
    <field name="created" type="date" indexed="true" stored="true"/>
    <field name="modified" type="date" indexed="true" stored="true"/>

    <field name="attr1" type="int" indexed="true" stored="true"/>
    <field name="attr2" type="long" indexed="true" stored="true"/>
    <field name="attr3" type="double" indexed="true" stored="true"/>
    <field name="attr4" type="int" indexed="true" stored="true"/>
    <field name="attr5" type="long" indexed="true" stored="true"/> 
    <field name="attr6" type="long" indexed="true" stored="true"/>
    <field name="attr7" type="double" indexed="true" stored="true"/>
    <field name="attr8" type="double" indexed="true" stored="true"/>

    <!-- CQL collection maps -->
    <dynamicField name="scores*" type="int" indexed="true" stored="true"/>
    <dynamicField name="estimates*" type="long" indexed="true" stored="true"/>
    <dynamicField name="searches*" type="long" indexed="true" stored="true"/>

    <!-- docValues - facet -->
    <field name="dv__domain" type="string" indexed="true" stored="false" docValues="true" multiValued="true"/>
    <field name="dv__attr4" type="int" indexed="true" stored="false" docValues="true" multiValued="true"/>
    <field name="dv__attr8" type="double" indexed="true" stored="false" docValues="true" multiValued="true"/>

    <!-- docValues - group -->
    <field name="dv__phrase" type="string" indexed="true" stored="false" docValues="true" multiValued="true"/>

    <!-- docValues - sort -->
    <field name="dv__attr2" type="long" indexed="true" stored="false" docValues="true" multiValued="true"/>
    <field name="dv__attr5" type="long" indexed="true" stored="false" docValues="true" multiValued="true"/>
    <field name="dv__attr1" type="int" indexed="true" stored="false" docValues="true" multiValued="true"/>
  </fields>

  <!-- Why we use copyFields for docValues: http://stackoverflow.com/questions/26495208/solr-docvalues-usage -->
  <copyField source="domain" dest="dv__domain"/>
  <copyField source="attr4" dest="dv__attr4"/>
  <copyField source="attr8" dest="dv__attr8"/>
  <copyField source="phrase" dest="dv__phrase"/>
  <copyField source="attr2" dest="dv__attr2"/>
  <copyField source="attr5" dest="dv__attr5"/>
  <copyField source="attr1" dest="dv__attr1"/>

  <defaultSearchField>phrase</defaultSearchField>
  <uniqueKey>(phrase,domain,host,path)</uniqueKey>
</schema>

我使用CQLSSTableWriter从MySQL转储的CSV中生成sstables。对于CQL映射，我选择Java HashMap来表示值。

我今天也发现即使是Cassandra似乎也存在复合PK和地图混合的问题。当我查看文件系统时，使用复合PK +映射的表副本的文件夹大小比使用简单PK +映射或复合PK +无映射的副本小得多

Answer 1

Cassandra的密钥限制为64K。

一般在Solr，＆＃34; text＆＃34;不应该用于密钥，因为它是标记化的。使用＆＃34;字符串＆＃34;而是改为。

正如Cassandra FAQ wiki所说，散列是使用密钥的长文本值的更好选择： http://wiki.apache.org/cassandra/FAQ#max_key_size

归根结底，它取决于您希望如何查询Solr文档。

＆＃34;限制的一般指导＆＃34; Solr只是为了合理而且＃34; - 任何事都很可能会让你在某个地方出现问题。

Solr uniqueKey的最大长度

1 个答案: