Question

我正在尝试提取HTML文件的元标记，并使用tika集成将它们索引到solr中。我无法用Tika提取这些元标记，也无法在solr中显示。

我的HTML文件看起来像这样。

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="product_id" content="11"/>
<meta name="assetid" content="10001"/>
<meta name="title" content="title of the article"/>
<meta name="type" content="0xyzb"/>
<meta name="category" content="article category"/>
<meta name="first" content="details of the article"/>

<h4>title of the article</h4>
<p class="link"><a href="#link">How cite the Article</a></p>
<p class="list">
  <span class="listterm">Length: </span>13 to 15 feet<br>
  <span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
  <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
  <span class="listterm">Diet: </span>leaves and branches of trees<br>
  <span class="listterm">Number of Young: </span>1<br>
  <span class="listterm">Home: </span>Sahara<br>
</p>
</p>

我的data-config.xml文件看起来像这样

<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
    <document>   
    <entity name="f" dataSource="null" rootEntity="false"
        processor="FileListEntityProcessor"
        baseDir="/path/to/html/files/" 
        fileName=".*html|xml" onError="skip"
        recursive="false">

        <field column="fileAbsolutePath" name="path" />
        <field column="fileSize" name="size"/>
        <field column="file" name="filename"/>

        <entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" 
        url="${f.fileAbsolutePath}" format="text" onError="skip">

        <field column="product_id" name="product_id" meta="true"/>
        <field column="assetid" name="assetid" meta="true"/>
        <field column="title" name="title" meta="true"/>
        <field column="type" name="type" meta="true"/>
        <field column="first" name="first" meta="true"/>
        <field column="category" name="category" meta="true"/>      
        </entity>
    </entity>
</document>
</dataConfig>

在我的schema.xml文件中，我添加了以下字段。

<field name="product_id" type="string" indexed="true" stored="true"/>
<field name="assetid" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="category" type="string" indexed="true" stored="true"/>
<field name="first" type="text_general" indexed="true" stored="true"/>

在我的solrconfing.xml文件中，我添加了以下代码。

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" />
<lst name="defaults">
  <str name="config">/path/to/data-config.xml</str>
</lst>

有谁知道如何从HTML文件中提取这些元标记并在solr和Tika中对它们进行索引？我们将非常感谢您的帮助。

Answer 1

我不认为meta =“true”表示你认为它意味着什么。它通常指的是约文件而不是内容。因此，内容类型等可能也会映射http-equiv。

除此之外，您需要提取实际内容。您可以使用format =“xml”然后将内部实体与XPathEntityProcessor放在一起并映射路径。除此之外，即使这样，你也受到限制，因为AFAIK，DIH使用了DefaultHtmlMapper，它在它允许的内容中具有极大的限制性，并且跳过大多数“class”和“id”属性甚至是“div”之类的东西。您可以在源代码中自己阅读list of allowed elements and attributes。

坦率地说，更简单的方法是拥有一个SolrJ客户端并自己管理Tika。然后你可以将它设置为使用IdentityHtmlMapper，它不会与HTML混淆。

Answer 2

您使用的是哪个版本的Solr？如果您使用的是Solr 4.0或更高版本，则会在其中嵌入tika。 Tika使用solrconfig.xml中配置的'Solr-Cells' 'ExtractingRequestHandler'类与solr进行通信，如下所示：

      <!-- Solr Cell Update Request Handler

       http://wiki.apache.org/solr/ExtractingRequestHandler 

    -->
  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

现在，在上面的配置中，默认情况下在solr中，从HTML文档中提取的未在schema.xml中声明的任何字段都以'ignored _'为前缀，即它们映射到 schema.xml中的'ignored _ *'动态字段。默认schema.xml，内容如下：

       <!-- some trie-coded dynamic fields for faster range queries -->
   <dynamicField name="*_ti" type="tint"    indexed="true"  stored="true"/>
   <dynamicField name="*_tl" type="tlong"   indexed="true"  stored="true"/>
   <dynamicField name="*_tf" type="tfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_td" type="tdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_tdt" type="tdate"  indexed="true"  stored="true"/>

   <dynamicField name="*_pi"  type="pint"    indexed="true"  stored="true"/>
   <dynamicField name="*_c"   type="currency" indexed="true"  stored="true"/>

   <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
   <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

   <dynamicField name="random_*" type="random" />

   <!-- uncomment the following to ignore any fields that don't already match an existing 
        field name or dynamic field, rather than reporting them as an error. 
        alternately, change the type="ignored" to some other type e.g. "text" if you want 
        unknown fields indexed and/or stored by default --> 
   <!--dynamicField name="*" type="ignored" multiValued="true" /-->

 </fields>

以下是'忽略'类型的处理方式：

<!-- since fields of this type are by default not stored or indexed,
     any data added to them will be ignored outright.  --> 
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

因此，由tika提取的元数据默认情况下由Solr-Cell放入'忽略'字段，这也就是为什么它们会被索引和存储忽略。因此，要更改索引和存储元数据，您可以更改“uprefix = attr _”或“创建特定字段或动态字段< / em>'对于你已知的元数据并按你的意愿对待它们。

所以，这是更正后的solrconfig.xml：

 <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="uprefix">attr_</str>  <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_</str> </lst> </requestHandler>

Answer 3

尽管是一个较旧的问题，我的答复是

我最近问了一个类似的问题（几天后没有回复或评论），我对此进行了整理，并且与该问题有关。
Solr多年来已经发生了很大的变化，有关该主题的现有文档（存在的地方）既令人困惑又有时是错误的。
虽然篇幅冗长，但该答复提供了示例和文档，以解决该问题。

简而言之，我现在删除的StackOverflow问题是“使用Apache Solr从HTML提取自定义（例如

简短的答案是，虽然索引“标准” HTML元素（a; div; h1; h2; li; meta; p; title; ... https://www.w3.org/TR/2005/WD-xhtml2-20050527/elements.html）相对容易，但它具有挑战性包括自定义标记集，而无需严格使用格式正确的XML文件和Solr中的更新功能（例如：https://lucene.apache.org/solr/guide/6_6/uploading-data-with-index-handlers.html#uploading-data-with-index-handlers），或者将captureAttr参数与Apache Tika一起使用，该参数通过ExtractingRequestHandler（如下所述）或其他工具，例如Apache Nutch。

诸如<title>Solr HTML Indexing Tests</title>之类的标准HTML元素易于索引；但是，诸如<my_id>bt-ic8eew2u</my_id>之类的非标准元素将被忽略。

虽然您可以应用基于XML的解决方案，例如<field name="my_id">bt-ic8eew2u</field>，但我更喜欢基于HTML的简便解决方案-因此，是HTML元数据方法。

环境：Arch Linux（x86_64）命令行； Apache Solr 8.7.0； FireFox 83.0中的Solr管理员用户界面（http：// localhost：8983 / solr /＃/ gettingstarted / query）

测试文件（solr_test9.html）：

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us">
<head>
  <meta charset="UTF-8" />
  <title>Solr HTML Indexing Tests</title>
  <meta name="date_created" content="2019-11-01" />
  <meta name="source_url" content="/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html" />
  <!-- <my_id>bt-ic8eew2u</my_id> -->
  <meta name="doc_id" content="bt-ic8eeW2U" />
  <meta name="date_pub" content="2020-11-16" />
</head>

<body>
<h1>Apples</h1>
<p>I like apples.</p>

<h2>Bananas</h2>
<p>I also like bananas.</p>

<p><div id="div1">This text is located in div element 1.</div></p>
<p><div id="div2">This text is located in div element 2.</div></p>

<br/>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
<br/>

<p>Suspendisse efficitur pulvinar elementum.</p>

<p>My website is <a href="https://buriedtruth.com/">BuriedTruth.com</a>.</p>

<h1>Nova Scotia</h1>
<p>Nova Scotia is a province on the east coast of Canada.</p>

<h2>Capital of Nova Scotia</h2>
<p>Halifax is the capital of N.S.</p>
<p>Halifax is also N.S.'s largest city.</p>

<h1>British Columbia</h1>
<h2>Capital of British Columbia</h2>
<p>Victoria is the capital of B.C.</p>
<p>Vancouver is the largest city in B.C., however.</p>

<p>Non-terminated sentence (missing period)</p>

<meta name="date_current" content="2020-11-17" />
<!-- Comments like these are not indexed. -->
<p>Current date: 2020-11-17</p>

</body>
</html>

solrconfig.xml

以下是我的solrconfig.xml文件中的相关补充内容。

  <!-- SOLR CELL PLUGINS: -->
  <lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />

  <!-- https://lucene.472066.n3.nabble.com/Prons-an-Cons-of-Startup-Lazy-a-Handler-td4059111.html -->
  <requestHandler name="/update/extract"
    class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
      <str name="capture">div</str>
      <str name="fmap.div">div</str>
      <str name="capture">h1</str>
      <str name="fmap.h1">h1</str>
      <str name="capture">h2</str>
      <str name="fmap.h2">h2_t</str>
      <str name="capture">p</str>
      <!-- <str name="fmap.p">p_t</str> -->
      <str name="fmap.p">p</str>
      <!-- COMMENT: note that the entries above refer to standard -->
      <!-- HTML elements.  As long as you have <meta/> (metadata) -->
      <!-- entries ("doc-id", "date_pub" ...) in your schema then -->
      <!-- Solr will automatically pick them up when indexing ... -->
      <!-- (hence no need to include those, here!).               -->
    </lst>
  </requestHandler>

  <!-- https://doc.lucidworks.com/fusion-server/5.2/reference/solr-reference-guide/7.7.2/update-request-processors.html -->
  <!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <!-- ======================================== -->
    <!-- https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html -->
    <processor class="solr.RegexReplaceProcessorFactory">
      <str name="fieldName">content</str>
      <str name="fieldName">title</str>
      <str name="fieldName">p</str>
      <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
      <!-- of this processor as needed: -->
      <str name="pattern">\s+</str>
      <str name="replacement"> </str>
      <bool name="literalReplacement">true</bool>
    </processor>

    <!-- Solr bug? URLs parse as "rect https..."  Managed-schema (Admin UI): defined p as text_general -->
    <!-- but did not parse. Looking at content | title: text_general copied to string, so added  -->
    <!-- copyfield of p (text_general) as p_str ... regex below now works! -->
    <!-- https://stackoverflow.com/questions/22178700/solr-extractingrequesthandler-extracting-rect-in-links/64882751#64882751 -->
      <processor class="solr.RegexReplaceProcessorFactory">
      <str name="fieldName">content</str>
      <str name="fieldName">title</str>
      <str name="fieldName">p</str>
      <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
      <!-- of this processor as needed: -->
      <str name="pattern">rect http</str>
      <str name="replacement">http</str>
      <bool name="literalReplacement">true</bool>
    </processor>
    <!-- ======================================== -->
    <!-- This needs to be last (may need to clear documents and re-index to see changes, e.g. Solr Admin UI): -->
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

托管模式（schema.xml）：

我通过Admin UI编辑了Solr模式。基本上，对于要索引的任何HTML元数据，添加一个名称相似的字段（适当的类型：例如text_general | string | pdate | ...）。

例如，为了捕获“ doc-id”和“ date_pub”元数据，我创建了以下（分别）模式条目：

<field name="doc_id" type="string" uninvertible="true" indexed="true" stored="true"/>
<field name="date_pub" type="pdate" uninvertible="true" indexed="true" stored="true"/>

索引

这是我为该HTML测试文件建立索引的方式，

[victoria@victoria solr-8.7.0]$ date; pwd; ls -l; echo; ls -l server/solr/gettingstarted/conf/

Tue Nov 17 02:18:12 PM PST 2020

/mnt/Vancouver/apps/solr/solr-8.7.0
total 1792
drwxr-xr-x  3 victoria victoria   4096 Nov 17 13:26 bin
-rw-r--r--  1 victoria victoria 946955 Oct 28 02:40 CHANGES.txt
drwxr-xr-x 12 victoria victoria   4096 Oct 29 07:09 contrib
drwxr-xr-x  4 victoria victoria   4096 Nov 15 12:33 dist
drwxr-xr-x  3 victoria victoria   4096 Nov 15 12:33 docs
drwxr-xr-x  6 victoria victoria   4096 Oct 28 02:40 example
drwxr-xr-x  2 victoria victoria  36864 Oct 28 02:40 licenses
-rw-r--r--  1 victoria victoria  12646 Oct 28 02:21 LICENSE.txt
-rw-r--r--  1 victoria victoria 766662 Oct 28 02:40 LUCENE_CHANGES.txt
-rw-r--r--  1 victoria victoria  27540 Oct 28 02:21 NOTICE.txt
-rw-r--r--  1 victoria victoria   7490 Oct 28 02:40 README.txt
drwxr-xr-x 11 victoria victoria   4096 Nov 15 12:40 server

total 208
drwxr-xr-x 2 victoria victoria  4096 Oct 28 02:21 lang
-rw-r--r-- 1 victoria victoria 33888 Nov 17 13:20 managed-schema
-rw-r--r-- 1 victoria victoria   873 Oct 28 02:21 protwords.txt
-rw-r--r-- 1 victoria victoria 33788 Nov 17 11:36 schema.xml.2020-11-17.13:01
-rw-r--r-- 1 victoria victoria 59248 Nov 17 13:16 solrconfig.xml
-rw-r--r-- 1 victoria victoria 59151 Nov 17 12:59 solrconfig.xml.2020-11-17.13:01
-rw-r--r-- 1 victoria victoria   781 Oct 28 02:21 stopwords.txt
-rw-r--r-- 1 victoria victoria  1124 Oct 28 02:21 synonyms.txt

[victoria@victoria solr-8.7.0]$ solr restart; sleep 1; post -c gettingstarted /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html

Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 3511453 to stop gracefully.
Waiting up to 180 seconds to see Solr running on port 8983 [|]  
Started Solr server on port 8983 (pid=3572520). Happy searching!

/usr/lib/jvm/java-8-openjdk/jre//bin/java -classpath /mnt/Vancouver/apps/solr/solr-8.7.0/dist/solr-core-8.7.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file solr_test9.html (text/html) to [base]/extract
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.755

[victoria@victoria solr-8.7.0]$

...这是结果（Solr管理员界面：http：// localhost：8983 / solr /＃/ gettingstarted / query）

http://localhost:8983/solr/gettingstarted/select?q=*%3A*

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "_":"1605651674401"}},
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
        "stream_size":[1428],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.html.HtmlParser"],
        "stream_content_type":["text/html"],
        "date_created":"2019-11-01T00:00:00Z",
        "date_current":["2020-11-17"],
        "resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
        "title":["Solr HTML Indexing Tests"],
        "date_pub":"2020-11-16T00:00:00Z",
        "doc_id":"bt-ic8eeW2U",
        "source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
        "dc_title":["Solr HTML Indexing Tests"],
        "content_encoding":["UTF-8"],
        "content_type":["application/xhtml+xml; charset=UTF-8"],
        "content":[" en-us stream_size 1428 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2019-11-01 resourceName /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html date_pub 2020-11-16 doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html dc:title Solr HTML Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 Solr HTML Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
        "div":[" div1 This text is located in div element 1. div2 This text is located in div element 2."],
        "p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
        "h1":[" Apples Nova Scotia British Columbia"],
        "h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
        "_version_":1683647678197530624}]
  }}

更新-managed-schema >> schema.xml名：

尽管与原始问题无关，但以下内容与我的答案有关（以上），特别是与从Solr的managed-schema切换到经典的（用户管理的）schema.xml相关的特殊程度。它包含在此处以提供完整的解决方案。

首先，添加

<schemaFactory class="ClassicIndexSchemaFactory"/>

到您的solrconfig.xml文件。

然后编辑此：->

<updateRequestProcessorChain
  name="add-unknown-fields-to-the-schema"
  default="${update.autoCreateFields:true}"
  processor="uuid,remove-blank,field-name-mutating,parse-boolean,
             parse-long,parse-double,parse-date,add-schema-fields">

...对此：

<updateRequestProcessorChain
  processor="uuid,remove-blank,field-name-mutating,parse-boolean,
             parse-long,parse-double,parse-date">

即删除

  name="add-unknown-fields-to-the-schema"
  default="${update.autoCreateFields:true}"
  add-schema-fields

将managed-schema重命名为schema.xml，然后重新启动Solr或重新加载内核以完成更改。

为进一步扩展示例（上述），以下是示例<updateRequestProcessorChain />和输出（基于我也提供的HTML代码）。

solrconfig.xml（一部分）：

<updateRequestProcessorChain
  processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date">
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.DistributedUpdateProcessorFactory"/>
  <processor class="solr.RegexReplaceProcessorFactory">
    <str name="fieldName">content</str>
    <str name="fieldName">title</str>
    <str name="fieldName">p</str>
    <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
    <!-- of this processor as needed: -->
    <str name="pattern">\s+</str>
    <str name="replacement"> </str>
    <bool name="literalReplacement">true</bool>
  </processor>

  <processor class="solr.RegexReplaceProcessorFactory">
    <str name="fieldName">content</str>
    <str name="fieldName">title</str>
    <str name="fieldName">p</str>
    <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
    <!-- of this processor as needed: -->
    <str name="pattern">rect http</str>
    <str name="replacement">http</str>
    <bool name="literalReplacement">true</bool>
  </processor>

  <processor class="solr.RegexReplaceProcessorFactory">
    <str name="fieldName">content</str>
    <str name="fieldName">title</str>
    <str name="pattern">[sS]olr</str>
    <str name="replacement">APPLE</str>
    <bool name="literalReplacement">true</bool>
  </processor>

  <processor class="solr.RegexReplaceProcessorFactory">
    <str name="fieldName">content</str>
    <str name="fieldName">title</str>
    <str name="pattern">HTML</str>
    <str name="replacement">BANANA</str>
    <bool name="literalReplacement">true</bool>
  </processor>

  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

输出

注意：
- 更改为“标题”（Solr >> APPLE; HTML >> BANANA）
- 从“ p”中的网址中删除“ rect”（在此处讨论：Solr ExtractingRequestHandler extracting "rect" in links）

{
  "responseHeader":{
    "status":0,
    "QTime":32,
    "params":{
      "q":"*:*",
      "_":"1605767164812"}},
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
        "stream_size":[1628],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.html.HtmlParser"],
        "stream_content_type":["text/html"],
        "date_created":"2020-11-11T21:36:38Z",
        "date_current":["2020-11-17"],
        "resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
        "title":["APPLE BANANA Indexing Tests"],
        "date_pub":"2020-11-16T21:37:18Z",
        "doc_id":"bt-ic8eeW2U",
        "source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
        "dc_title":["Solr HTML Indexing Tests"],
        "content_encoding":["UTF-8"],
        "content_type":["application/xhtml+xml; charset=UTF-8"],
        "content":[" en-us stream_size 1628 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2020-11-11T21:36:38Z resourceName /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html date_pub 2020-11-16T21:37:18Z doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html dc:title APPLE BANANA Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 APPLE BANANA Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
        "div":[" div1 This text is located in div element 1. div2 This text is located in div element 2. apple This text is located in the \"apple\" (class) div element. banana This text is located in the \"banana\" (class) div element."],
        "p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
        "h1":[" Apples Nova Scotia British Columbia"],
        "h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
        "_version_":1683814668971278336}]
  }}

如何从HTML文件中提取元标记并在SOLR和TIKA中对它们进行索引

3 个答案: