如何将查询结果提供给具有保留层次结构的列的数据框?像这样的列:
type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords|
我有一个带有大约1,000,000个JSOn文档的弹性搜索。
我想将这个数据集用于Python的自然语言处理(NLP)。
有人可以帮助我如何将弹性搜索中的数据导入Python并将数据写回到弹性搜索中。
非常感谢它,因为我无法对我拥有的数据集执行任何NLP,因为我无法将其与Python连接。
这就是elasticsearch的索引结构如下:
我想在层次结构中输入一个新索引,就像"大学信息"叫"处理信息"
并且这个新索引将根据我给出的一组关键字索引数据集 - 就像" universityKeywords"每个jason文件都应该存储标签使用的关键字集。
我想将数据集标记为"处理信息" - 在名为json的json文件中放置4个标签或类别 - 应用程序,优惠,注册,基于json文件标题后的关键字和发布文本的要求
"educationforumsenriched2": {
"mappings": {
"whirlpool": {
"properties": {
"CourseInfo": {
"properties": {
"courses": {
"type": "string",
"index": "not_analyzed"
},
"subjectKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"SentimentInfo": {
"properties": {
"SentiStrength": {
"type": "float"
},
"SentiWordNet": {
"type": "float"
}
}
},
"UniversityInfo": {
"properties": {
"universities": {
"type": "string",
"index": "not_analyzed"
},
"universityKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"postDate": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"postID": {
"type": "integer"
},
"postText": {
"type": "string"
},
"references": {
"type": "string"
},
"threadID": {
"type": "integer"
},
"threadTitle": {
"type": "string"
}
}
},
"atarnotes": {
"properties": {
"CourseInfo": {
"properties": {
"courses": {
"type": "string",
"index": "not_analyzed"
},
"subjectKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"SentimentInfo": {
"properties": {
"SentiStrength": {
"type": "float"
},
"SentiWordNet": {
"type": "float"
}
}
},
"UniversityInfo": {
"properties": {
"universities": {
"type": "string",
"index": "not_analyzed"
},
"universityKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"discussionTitle": {
"type": "string"
},
"postDate": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"postID": {
"type": "integer"
},
"postText": {
"type": "string"
},
"query": {
"properties": {
"match_all": {
"type": "object"
}
}
},
"threadID": {
"type": "integer"
},
"threadTitle": {
"type": "string"
}
}
}
}
}
}
这是我用来在java中创建进程信息标记的代码 - 我想在Python中做同样的事情
processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications")));
processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering")));
processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled")));
processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require")));
答案 0 :(得分:1)
使用elasticsearch python client,一旦建立了成功的连接,您只需提供DSL查询和要搜索的索引以检索所需信息,例如,如果您有查询:
GET educationforumsenriched2/_search
{
"query": {
"match" : {
"CourseInfo.subjectKeywords" : "foo"
}
}
}
Python中的等价物是:
from elasticsearch import Elasticsearch
es = Elasticsearch({"host": "localhost", "port": 9200}) #many other settings are available if using https and so on
query = {
"query": {
"match" : {
"CourseInfo.subjectKeywords" : "foo"
}
}
}
res = es.search(index="educationforumsenriched2", body=query)
#do some processing
#create new document in ES
es.create(index="educationforumsenriched2", body=new_doc_after_processing)
修改:只是考虑一下,但如果您的处理过于复杂,您还可以考虑构建ingest pipeline