解析Elasticsearch文档的Google自定义搜索API

时间:2015-06-22 19:03:04

标签: json python-2.7 elasticsearch google-search-api

Google Custom Search API检索结果并将其写入JSON后,我想解析该JSON以生成有效的Elasticsearch文档。您可以为嵌套结果配置父子关系。但是,这种关系似乎不是由数据结构本身推断的。我尝试过自动加载,但不是结果。

下面是一些不包含id或index等内容的示例输入。我试图专注于创建正确的数据结构。我尝试过修改深度优先搜索等图形算法,但遇到了不同数据结构的问题。

以下是一些示例输入:

# mock data structure
google = {"content": "foo", 
          "results": {"result_one": {"persona": "phone",
                                     "personb":  "phone",
                                     "personc":  "phone"
                                    },
                      "result_two": ["thing1",
                                     "thing2",
                                     "thing3"
                                    ],
                      "result_three": "none"
                     },
          "query": ["Taylor Swift", "Bob Dole", "Rocketman"]
}

# correctly formatted documents for _source of elasticsearch entry
correct_documents = [
    {"content":"foo"},
    {"results": ["result_one", "result_two", "result_three"]},
    {"result_one": ["persona", "personb", "personc"]},
    {"persona": "phone"},
    {"personb": "phone"},
    {"personc": "phone"},
    {"result_two":["thing1","thing2","thing3"]},
    {"result_three": "none"},
    {"query": ["Taylor Swift", "Bob Dole", "Rocketman"]}
]

这是我目前的方法,这仍然是一项正在进行的工作:

def recursive_dfs(graph, start, path=[]):
  '''recursive depth first search from start'''
  path=path+[start]
  for node in graph[start]:
    if not node in path:
      path=recursive_dfs(graph, node, path)
  return path

def branching(google):
    """ Get branches as a starting point for dfs"""
    branch = 0
    while branch < len(google):

        if google[google.keys()[branch]] is dict:

            #recursive_dfs(google, google[google.keys()[branch]])
            pass

        else:
            print("branch {}: result {}\n".format(branch,     google[google.keys()[branch]]))

        branch += 1

branching(google)

您可以看到仍需要修改recursive_dfs()来处理字符串和列出数据结构。

我会继续这样做,但如果你有想法,建议或解决方案,那么我会非常感激。谢谢你的时间。

1 个答案:

答案 0 :(得分:1)

这是您问题的可能答案。

def myfunk( inHole, outHole):
    for keys in inHole.keys():
        is_list = isinstance(inHole[keys],list);
        is_dict = isinstance(inHole[keys],dict);
        if is_list:
            element = inHole[keys];
            new_element = {keys:element};
            outHole.append(new_element);
        if is_dict:
            element = inHole[keys].keys();
            new_element = {keys:element};
            outHole.append(new_element);
            myfunk(inHole[keys], outHole);
        if not(is_list or is_dict):
            new_element = {keys:inHole[keys]};
            outHole.append(new_element);
    return outHole.sort();