我想将以下HTML文件(b)的以下部分(a)导入WDS。
(a)<meta content="https://qiita.com/xxx/yyy/zzz" property="og:url" />
我使用以下参考创建了以下WDS配置文件(c),将其应用于环境,并导入了以下HTML文件(b)。
https://console.bluemix.net/docs/services/discovery/custom-config.html#keep_content
但是我找不到“查看数据模式”的结果,也没有找到“在文档上测试您的配置” API的结果(请参见下文)。
所以我有以下3个问题(1)(2)(3)。
您能否告诉我制作配置文件的正确方法,如果 以下一个(c)是错误的吗?
(a)部分在“查看数据”结果中的何处显示 模式”(如果以下配置文件(c)是正确的?(添加 作为extract_metadata的一部分?参数名称是og:url吗?)
(b)HTML文件:
(c)WDS配置文件:
{
"configuration_id": "cbcec10a-f241-4fb5-a86d-15e1e732495d",
"name": "HTML_conf_0914_2",
"description": null,
"created": "2018-08-03T00:08:52.320Z",
"updated": "2018-08-13T01:42:20.763Z",
"conversions": {
"pdf": {
"heading": {
"fonts": [
{
"level": 1,
"min_size": 24,
"max_size": 80
},
{
"level": 2,
"min_size": 18,
"max_size": 24,
"bold": false,
"italic": false
},
{
"level": 2,
"min_size": 18,
"max_size": 24,
"bold": true
},
{
"level": 3,
"min_size": 13,
"max_size": 18,
"bold": false,
"italic": false
},
{
"level": 3,
"min_size": 13,
"max_size": 18,
"bold": true
},
{
"level": 4,
"min_size": 11,
"max_size": 13,
"bold": false,
"italic": false
}
]
}
},
"word": {
"heading": {
"fonts": [
{
"level": 1,
"min_size": 24,
"bold": false,
"italic": false
},
{
"level": 2,
"min_size": 18,
"max_size": 23,
"bold": true,
"italic": false
},
{
"level": 3,
"min_size": 14,
"max_size": 17,
"bold": false,
"italic": false
},
{
"level": 4,
"min_size": 13,
"max_size": 13,
"bold": true,
"italic": false
}
],
"styles": [
{
"level": 1,
"names": [
"pullout heading",
"pulloutheading",
"header"
]
},
{
"level": 2,
"names": [
"subtitle"
]
}
]
}
},
"html": {
"exclude_tags_completely": [
"script",
"sup"
],
"exclude_tags_keep_content": [
"font",
"em",
"span"
],
"exclude_content": {
"xpaths": [
"//meta[@name]",
"//meta[@property!='og:url']"
]
},
"keep_content": {
"xpaths": [
]
},
"exclude_tag_attributes": [
"EVENT_ACTIONS"
]
},
"json_normalizations": [],
"segment": {
"enabled": true,
"selector_tags": [
"h1",
"h2",
"h3"
]
}
},
"enrichments": [
{
"enrichment": "natural_language_understanding",
"source_field": "text",
"destination_field": "enriched_text",
"options": {
"features": {
"keywords": {},
"entities": {
"sentiment": true,
"emotion": false,
"limit": 50
},
"sentiment": {
"document": true
},
"categories": {},
"relations": {},
"concepts": {
"limit": 8
},
"semantic_roles": {}
}
}
}
],
"normalizations": []
}
答案 0 :(得分:0)
到目前为止,Watson Discovery Service仅从HTML <head>
部分提取以下三个元数据字段: publicationdate , author 和 title 。
它们应位于您的HTML文件中,如以下示例所示:
<html>
<head>
<meta name="author" content="Lulu">
<meta name="publicationdate" content="2015-12-04">
<title>Title of the document</title>
</head>
<body>
content of the document
</body>
</html>
一旦在提取过程中提取了这三个字段,就可以在extracted_metadata
部分下查询它们。以下显示了在查询结果中可以找到这些字段的位置的示例和示例:
{
"extracted_metadata": {
"publicationdate": "2015-12-04",
"title": "Title of the document",
"author": "Lulu",
"filename": "example.html",
"file_type": "html",
"sha1": "256f2c4161a1b13528513a3d4abdf00b6ac80054"
},
"html": "<?xml version='1.0' encoding='UTF-8' standalone='yes'?><html> ...",
"text": "content of the document",
}
不幸的是,当前不支持从HTML <head>
部分提取其他类型的元数据字段。
还有另一种方法可以通过在POST请求中传入元数据部分来摄取自定义字段。使用curl
,您可以通过运行以下格式的命令来做到这一点:
curl -u ${WDS_USERNAME}:${WDS_PASSWORD} \
-F "file=@YOUR_FILE.html" \
-F "metadata=@YOUR_METADATA.json" \
-X POST "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/documents?version=2018-03-05"
请参见https://www.ibm.com/watson/developercloud/discovery/api/v1/curl.html?curl#add-document
中的metadata
参数