Question

如何将查询结果提供给具有保留层次结构的列的数据框？像这样的列：

type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords|

我有一个带有大约1,000,000个JSOn文档的弹性搜索。我想将这个数据集用于Python的自然语言处理（NLP）。有人可以帮助我如何将弹性搜索中的数据导入Python并将数据写回到弹性搜索中。非常感谢它，因为我无法对我拥有的数据集执行任何NLP，因为我无法将其与Python连接。这就是elasticsearch的索引结构如下：
我想在层次结构中输入一个新索引，就像＆＃34;大学信息＆＃34;叫＆＃34;处理信息＆＃34; 并且这个新索引将根据我给出的一组关键字索引数据集 - 就像＆＃34; universityKeywords＆＃34;每个jason文件都应该存储标签使用的关键字集。我想将数据集标记为＆＃34;处理信息＆＃34; - 在名为json的json文件中放置4个标签或类别 - 应用程序，优惠，注册，基于json文件标题后的关键字和发布文本的要求

 "educationforumsenriched2": {
          "mappings": {
             "whirlpool": {
                "properties": {
                   "CourseInfo": {
                      "properties": {
                         "courses": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "subjectKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "SentimentInfo": {
                      "properties": {
                         "SentiStrength": {
                            "type": "float"
                         },
                         "SentiWordNet": {
                            "type": "float"
                         }
                      }
                   },
                   "UniversityInfo": {
                      "properties": {
                         "universities": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "universityKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "postDate": {
                      "type": "date",
                      "format": "strict_date_optional_time||epoch_millis"
                   },
                   "postID": {
                      "type": "integer"
                   },
                   "postText": {
                      "type": "string"
                   },
                   "references": {
                      "type": "string"
                   },
                   "threadID": {
                      "type": "integer"
                   },
                   "threadTitle": {
                      "type": "string"
                   }
                }
             },
             "atarnotes": {
                "properties": {
                   "CourseInfo": {
                      "properties": {
                         "courses": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "subjectKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "SentimentInfo": {
                      "properties": {
                         "SentiStrength": {
                            "type": "float"
                         },
                         "SentiWordNet": {
                            "type": "float"
                         }
                      }
                   },
                   "UniversityInfo": {
                      "properties": {
                         "universities": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "universityKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "discussionTitle": {
                      "type": "string"
                   },
                   "postDate": {
                      "type": "date",
                      "format": "strict_date_optional_time||epoch_millis"
                   },
                   "postID": {
                      "type": "integer"
                   },
                   "postText": {
                      "type": "string"
                   },
                   "query": {
                      "properties": {
                         "match_all": {
                            "type": "object"
                         }
                      }
                   },
                   "threadID": {
                      "type": "integer"
                   },
                   "threadTitle": {
                      "type": "string"
                   }
                }
             }
          }
       }
    }

这是我用来在java中创建进程信息标记的代码 - 我想在Python中做同样的事情

 processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications")));
        processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering")));
        processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled")));
        processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require")));

Answer 1

使用elasticsearch python client，一旦建立了成功的连接，您只需提供DSL查询和要搜索的索引以检索所需信息，例如，如果您有查询：

GET educationforumsenriched2/_search
{
    "query": {
        "match" : {
            "CourseInfo.subjectKeywords" : "foo"
        }
    }
}

Python中的等价物是：

from elasticsearch import Elasticsearch

es = Elasticsearch({"host": "localhost", "port": 9200}) #many other settings are available if using https and so on

query = {
        "query": {
            "match" : {
                "CourseInfo.subjectKeywords" : "foo"
            }
        }
    }
res = es.search(index="educationforumsenriched2", body=query)

#do some processing

#create new document in ES
es.create(index="educationforumsenriched2", body=new_doc_after_processing)

修改：只是考虑一下，但如果您的处理过于复杂，您还可以考虑构建ingest pipeline

从ElasticSearch-JSON文件中将数据导入Python

1 个答案: