如何将HTML文件的特定部分导入WDS?

时间:2018-10-09 13:31:33

标签: ibm-watson watson-discovery

我想将以下HTML文件(b)的以下部分(a)导入WDS。

(a)<meta content="https://qiita.com/xxx/yyy/zzz" property="og:url" />

我使用以下参考创建了以下WDS配置文件(c),将其应用于环境,并导入了以下HTML文件(b)。

https://console.bluemix.net/docs/services/discovery/custom-config.html#keep_content

但是我找不到“查看数据模式”的结果,也没有找到“在文档上测试您的配置” API的结果(请参见下文)。

https://www.ibm.com/watson/developercloud/discovery/api/v1/curl.html?curl#test-your-configuration-on-a-document-api

所以我有以下3个问题(1)(2)(3)。

  1. 您能否告诉我制作配置文件的正确方法,如果 以下一个(c)是错误的吗?

  2. (a)部分在“查看数据”结果中的何处显示 模式”(如果以下配置文件(c)是正确的?(添加 作为extract_metadata的一部分?参数名称是og:url吗?)

  3. 如果每个分割的文档都导入了(a)部分, 随附的配置文件(c)是否正确?

(b)HTML文件:

(c)WDS配置文件:

{
  "configuration_id": "cbcec10a-f241-4fb5-a86d-15e1e732495d",
  "name": "HTML_conf_0914_2",
  "description": null,
  "created": "2018-08-03T00:08:52.320Z",
  "updated": "2018-08-13T01:42:20.763Z",
  "conversions": {
    "pdf": {
      "heading": {
        "fonts": [
          {
            "level": 1,
            "min_size": 24,
            "max_size": 80
          },
          {
            "level": 2,
            "min_size": 18,
            "max_size": 24,
            "bold": false,
            "italic": false
          },
          {
            "level": 2,
            "min_size": 18,
            "max_size": 24,
            "bold": true
          },
          {
            "level": 3,
            "min_size": 13,
            "max_size": 18,
            "bold": false,
            "italic": false
          },
          {
            "level": 3,
            "min_size": 13,
            "max_size": 18,
            "bold": true
          },
          {
            "level": 4,
            "min_size": 11,
            "max_size": 13,
            "bold": false,
            "italic": false
          }
        ]
      }
    },
    "word": {
      "heading": {
        "fonts": [
          {
            "level": 1,
            "min_size": 24,
            "bold": false,
            "italic": false
          },
          {
            "level": 2,
            "min_size": 18,
            "max_size": 23,
            "bold": true,
            "italic": false
          },
          {
            "level": 3,
            "min_size": 14,
            "max_size": 17,
            "bold": false,
            "italic": false
          },
          {
            "level": 4,
            "min_size": 13,
            "max_size": 13,
            "bold": true,
            "italic": false
          }
        ],
        "styles": [
          {
            "level": 1,
            "names": [
              "pullout heading",
              "pulloutheading",
              "header"
            ]
          },
          {
            "level": 2,
            "names": [
              "subtitle"
            ]
          }
        ]
      }
    },
    "html": {
      "exclude_tags_completely": [
        "script",
        "sup"
      ],
      "exclude_tags_keep_content": [
        "font",
        "em",
        "span"
      ],
      "exclude_content": {
        "xpaths": [
          "//meta[@name]",
          "//meta[@property!='og:url']"
        ]
      },
      "keep_content": {
        "xpaths": [
        ]
      },
      "exclude_tag_attributes": [
        "EVENT_ACTIONS"
      ]
    },
    "json_normalizations": [],
    "segment": {
      "enabled": true,
      "selector_tags": [
        "h1",
        "h2",
        "h3"
      ]
    }
  },
  "enrichments": [
    {
      "enrichment": "natural_language_understanding",
      "source_field": "text",
      "destination_field": "enriched_text",
      "options": {
        "features": {
          "keywords": {},
          "entities": {
            "sentiment": true,
            "emotion": false,
            "limit": 50
          },
          "sentiment": {
            "document": true
          },
          "categories": {},
          "relations": {},
          "concepts": {
            "limit": 8
          },
          "semantic_roles": {}
        }
      }
    }
  ],
  "normalizations": []
}

1 个答案:

答案 0 :(得分:0)

到目前为止,Watson Discovery Service仅从HTML <head>部分提取以下三个元数据字段: publicationdate author title

它们应位于您的HTML文件中,如以下示例所示:

<html>
  <head>
   <meta name="author" content="Lulu">
   <meta name="publicationdate" content="2015-12-04">
   <title>Title of the document</title>
  </head>
 <body>
  content of the document
 </body>
</html>

一旦在提取过程中提取了这三个字段,就可以在extracted_metadata部分下查询它们。以下显示了在查询结果中可以找到这些字段的位置的示例和示例:

{
    "extracted_metadata": {
        "publicationdate": "2015-12-04",
        "title": "Title of the document",
        "author": "Lulu",
        "filename": "example.html",
        "file_type": "html",
        "sha1": "256f2c4161a1b13528513a3d4abdf00b6ac80054"
    },
    "html": "<?xml version='1.0' encoding='UTF-8' standalone='yes'?><html> ...", 
    "text": "content of the document",
}

不幸的是,当前不支持从HTML <head>部分提取其他类型的元数据字段。

还有另一种方法可以通过在POST请求中传入元数据部分来摄取自定义字段。使用curl,您可以通过运行以下格式的命令来做到这一点:

curl -u ${WDS_USERNAME}:${WDS_PASSWORD} \
-F "file=@YOUR_FILE.html" \
-F "metadata=@YOUR_METADATA.json" \
-X POST "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/documents?version=2018-03-05"

请参见https://www.ibm.com/watson/developercloud/discovery/api/v1/curl.html?curl#add-document

中的metadata参数