使用查询http://abc:9200/abc/_search?q=aid:123
,我可以得到以下结果。我要过滤的第一件事是“帮助”,然后仅捕获所需字段(“ act_timestamp”,“ act_type”,“ mod_path”,“ mod_size”)以及“ procguid”,并将它们放到表中。
在下面提供确切的输入和输出:
输入:
{
"took": 95,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "xxx",
"_type": "_doc",
"_id": "xxxxxxxxx",
"_score": 1,
"_source": {
"aid": 123,
"activity": [
{
"act_timestamp": 1549437711762,
"act_type": "reg_act",
"reg_action": "writeval",
"reg_action_typeid": 3
},
{
"act_timestamp": 1549437714812,
"act_type": "reg_act",
"reg_action": "writeval",
"reg_action_typeid": 3
},
{
"act_timestamp": 1549437711418,
"act_type": "child_proc",
"child_name": "audiodg.exe"
}
],
"event_timestamp": 1549437712183,
"tid": "demo",
"procguid": "11111",
"doc_id": 1000881655327721
}
},
{
"_index": "xxx",
"_type": "_doc",
"_id": "xxxxxxxxx",
"_score": 1,
"_source": {
"aid": 123,
"activity": [
{
"act_timestamp": 1549437977317,
"act_type": "mod_load",
"mod_path": "path1",
"mod_size": 32768
},
{
"act_timestamp": 1549437977390,
"act_type": "mod_load",
"mod_path": "path2",
"mod_size": 667648
},
{
"act_timestamp": 1549437977399,
"act_type": "reg_act",
"reg_action": "writeval",
"reg_action_typeid": 3
},
{
"act_timestamp": 1549437977407,
"act_type": "child_proc",
"child_name": "conhost.exe"
}
],
"event_timestamp": 1549437977406,
"tid": "demo",
"procguid": "22222",
"doc_id": 1010298962848944
}
},
{
"_index": "xxx",
"_type": "_doc",
"_id": "xxxxxxxxx",
"_score": 1,
"_source": {
"aid": 123,
"activity": [
{
"act_timestamp": 1549437772915,
"act_type": "mod_load",
"mod_path": "path3",
"mod_size": 409600
},
{
"act_timestamp": 1549437772940,
"act_type": "mod_load",
"mod_path": "path4",
"mod_size": 937984
},
{
"act_timestamp": 1549437953405,
"act_type": "child_proc",
"child_name": "conhost.exe"
}
],
"event_timestamp": 1549437953405,
"tid": "demo",
"procguid": "22222",
"doc_id": 1007612603810098
}
},
{
"_index": "xxx",
"_type": "_doc",
"_id": "xxxxxxxxx",
"_score": 1,
"_source": {
"aid": 123,
"activity": [
{
"act_timestamp": 1549437848842,
"act_type": "mod_load",
"mod_path": "path5",
"mod_size": 1679360
},
{
"act_timestamp": 1549437848844,
"act_type": "mod_load",
"mod_path": "path6",
"mod_size": 2121728
},
{
"act_timestamp": 1549437848864,
"act_type": "mod_load",
"mod_path": "path7",
"mod_size": 266240
},
{
"act_timestamp": 1549437849590,
"act_type": "reg_act",
"reg_action": "writeval",
"reg_action_typeid": 3
},
{
"act_timestamp": 1549437953418,
"act_type": "child_proc",
"child_name": "wpscloudsvr.exe"
}
],
"event_timestamp": 1549437953417,
"tid": "demo",
"procguid": "33333",
"doc_id": 1007725853753652
}
}]}}
我希望输出如下:
procguid act_type mod_path mod_size
22222 mod_load path1 32768
22222 mod_load path2 667648
22222 mod_load path3 409600
22222 mod_load path4 937984
33333 mod_load path5 1679360
33333 mod_load path6 2121728
33333 mod_load path7 266240
那么我有什么办法可以将大量的Elasticsearch文档转换成这样的数据框?
有任何建议/解决方案代码吗?我刚接触Elasticsearch / json很久了。
编辑:在某些区域中,具有“ procguid”的同时也缺少“活动”字段
答案 0 :(得分:0)
您可以将任务分解为多个部分。首先,您要专注于
df = pd.io.json.json_normalize(d['hits']['hits'])
这具有列
['_id', '_index', '_score', '_source.activity', '_source.aid', '_source.doc_id', '_source.event_timestamp', '_source.procguid', '_source.tid', '_type']
现在,您可以遍历列_source.activity
,将每个单元格转换为一个DataFrame并收集它们:
dfs = []
for idx, cell in df['_source.activity'].iteritems():
df_activity = pd.io.json.json_normalize(cell)
df_activity['procguid'] = df.loc[idx, '_source.procguid']
dfs.append(df_activity)
将所有内容放在一起(希望)可以达到预期的结果:
pd.concat(dfs, sort=False, ignore_index=True)[['procguid', 'act_type', 'mod_path', 'mod_size']]