Question

我在源数据中有一个名为Overview的内容字段，该内容字段存储在Solr中名为tm_overview的文本字段中（为什么它是多值的，我不知道，但是在到达之前就已经完成），这是一个标准的文本字段。我在搜索过程中发现HTML标记中的数字和文本时遇到问题。例如，在166上搜索会找到以下文本并返回一条记录：

<img height=\"166\" src=\"[custom:asset-url]/6004064a_laser_dstnc_meter_emph_250x131_0.jpg\" width=\"250\" />

因此，很明显，我需要从字段内容中剥离HTML标记及其内容，并且看起来要使用的工具是HTMLCharFilterFactory。该字段同时将indexed和stored都设置为true，因此据我所知，将使用{中<fieldType>定义中定义的索引分析器对内容进行索引{1}}，然后返回该字段时，它将返回原始存储的数据（这就是我想要的）。

使用测试索引，我在schema.xml中创建了以下<fieldType>定义。

schema_extra_types.xml

传递给该字段的值如下：

<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <!--<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="#strong" replacement="" />-->
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.CJKBigramFilterFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.CJKBigramFilterFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

甚至只是段落文本：

<table>
    <tbody>
        <tr>
            <td align="center">
            <figure class="center"><img height="166" src="[custom:asset-url]/F_tix520_05a_250x147_0.jpg" width="250" /></figure>

            <div class="small-font">更快地导航、捕捉和处理图像</div>
            </td>
            <td>
            ...

但是，所有标签都没有被剥离。我需要采取其他措施来去除标签吗？

第二个问题与This infrared camera gives you easier angles with a 240° rotating screen and broader temperature range The blah blah product will help you easily navigate over, under and around hard to reach targets with the full 240° rotating screen. You can capture and process images quickly and analyze images in the field on the 5.7 inch responsive touchscreen LCD with on-camera analytics. Save time by editing emissivity, background temp, transmissivity, palettes, color alarms, adjusting IR-Fusion, and enabling/disabling markers all on the camera.和indexed的值有关。如前所述，既然您可以使索引值与原始存储版本不同，那么如何看待两者之间的区别？如果我在Solr管理员ui中进行查询，我会看到哪个版本的字段？索引还是存储？

Answer 1

存储的值（通常）从不更改（..但you can do that through an update chain if necessary-但这听起来并不是您想要的），只要您要求将其存储。对于索引而言，重要的是您放入字段中的内容产生的令牌。返回的内容将不会更改，也不会取决于在后台索引哪些标记以进行搜索。 Solr返回的值始终（至少只要将该字段设置为存储字段即可-docValues和“使用存储状态”可能会有所不同-目前我不记得了）与您在该字段中输入的值相同。

这还意味着，由于您要将HTML作为内容发送到字段中，因此Solr将在发送HTML时存储它。每次更改定义时，您还必须重新索引（重新提交）您的内容字段，除非您仅更改分析链的query部分。

要确切了解字段的处理方式，请在管理界面中转到集合，选择 Analysis 并将HTML粘贴到“索引”（左侧）框中。在右侧框中，输入166或应用程序中使用的搜索字符串的另一个示例。选择要显示其处理的字段，然后按提交按钮。

这将准确显示每个字段的处理方式，以及链中每个过滤器后的结果。产生的令牌很重要，如果这些令牌出现在处理链的两侧，那么这些令牌就会产生匹配。

从内容中过滤HTML标签仅用于搜索，而不显示在Solr

1 个答案: