将嵌套的json / dict转换为元组格式时出现问题?

时间:2016-10-21 20:51:08

标签: python json python-3.x parsing pandas

更新

考虑以下dict。如何通过以下方式提取符合的4元组: lemaoriginal_formtag以及当且仅当其id时。到目前为止,这是我尝试过的:

def gettuples(data, level = 0):
    if isinstance(data, dict):
        if 'semtheme_list' in data:
            print(data['semtheme_list'][0])
            yield data['semtheme_list'][0]

        elif 'analysis_list' in data:
            print(data['analysis_list'][0])
            yield data['analysis_list'][0]

        for val in data.values():
            yield from gettuples(val)

    elif isinstance(data, list):
        for val in data:
            yield from gettuples(val)

使用上述功能,我得到以下(*):

{'lemma': '*', 'tag': 'Z-----------', 'original_form': "Robert Downey Jr has topped Forbes magazine's annual list John Deere"}
{'lemma': 'Robert Downey Jr', 'tag': 'GNUS3S--', 'original_form': 'Robert Downey Jr'}
{'sense_id_list': [{'sense_id': '__12123288058840445720'}], 'lemma': 'Robert Downey Jr', 'tag': 'NPUU-N-', 'original_form': 'Robert Downey Jr'}
{'lemma': 'top', 'tag': 'VI-S3PPA-N-N9', 'original_form': 'has topped'}
{'lemma': 'John Deere', 'tag': 'GN-S3D--', 'original_form': "Forbes magazine's annual list John Deere"}
{'lemma': 'magazine', 'tag': 'GN-S3---', 'original_form': 'Forbes magazine'}
{'sense_id_list': [{'sense_id': 'db0f9829ff'}], 'lemma': 'Forbes', 'tag': 'NP-S-N-', 'original_form': 'Forbes'}
{'type': 'Top>SocialSciences>Economy', 'id': 'ODTHEME_ECONOMY'}

这与我正在寻找的4元组非常相似(**):

 [[['Z-----------', "Robert Downey Jr has topped Forbes magazine's annual list John Deere", '*'], ['GNUS3S--', 'Robert Downey Jr', 'Robert Downey Jr'], ['NPUU-N-', 'Robert Downey Jr', 'Robert Downey Jr'], ['VI-S3PPA-N-N9', 'has topped', 'top'], ['GN-S3D--', "Forbes magazine's annual list John Deere", 'John Deere'], ['GN-S3---', 'Forbes magazine', 'magazine'], ['NP-S-N-', 'Forbes', 'Forbes'], ['NC-S-N5', 'magazine', 'magazine'], ['WN-', "'s", "'s"], ['GN-S3---', 'annual list John Deere', 'John Deere'], ['GN-S3---', 'annual list', 'list'], ['AP-N5', 'annual', 'annual'], ['NC-S-N5', 'list', 'list'], ['GN-S3Y--', 'John Deere', 'John Deere'], ['NP-S-N-', 'John Deere', 'John Deere']]]

entity_list id

 entity_list: [{ form: "John Deere", official_form: "Deere & Company", id: "d5250a54a8", sementity: { class: "instance", fiction: "nonfiction", id: "ODENTITY_INDUSTRIAL_COMPANY", type: "Top>Organization>Company>IndustrialCompany" 
}

然后,当我打印时:

result = [['lema:',obj['lemma'], 'original_form', obj['original_form'], 'tag:',obj['tag']] for obj in gettuples(json_data)]

print(result)

我收到了这个错误:

  File "/Users/user/PycharmProjects/Tests/test.py", line 51, in pos_tag2
    result = [['lema:',obj['lemma'], 'original_form', obj['original_form'], 'tag:',obj['tag']] for obj in gettuples(json_data)]
  File "/Users/user/PycharmProjects/Tests/test.py", line 51, in <listcomp>
    result = [['lema:',obj['lemma'], 'original_form', obj['original_form'], 'tag:',obj['tag']] for obj in gettuples(json_data)]
KeyError: 'lemma'

所以,我的问题是:如何从(*)?获得4元组格式,或者我应该采用哪种其他方法来提取符合{{1}的4元组},lemaoriginal_form,以及当前是否tag

更新2

或者,我尝试的另一件事是使用json_normalize

在:

id

输出:

from pandas.io.json import json_normalize
df = json_normalize(request, ['token_list',['token_list']])
df = pd.DataFrame(df)
df

然后:

    affected_by_negation    analysis_list   endp    form    id  inip    quote_level     separation  style   token_list  type
0   no  [{'lemma': '*', 'tag': 'Z-----------', 'origin...   4   Deere   6   0   0   _   {'isTitle': 'no', 'isItalics': 'no', 'isUnderl...   [{'form': 'Deere', 'analysis_list': [{'lemma':...   phrase

输出:

df_clean =  df.drop(df.columns[[0, 2,4, 5, 6, 7, 8, 10]], axis=1)
df_clean
list(df_clean.itertuples(index=False))

然而,我在访问列表的特定值时遇到问题。另一个可能的解决方案可能是熊猫......如何做到这一点的任何想法?。

1 个答案:

答案 0 :(得分:1)

以下代码应该可以满足您的需求。这不是最优雅的方法,但希望很清楚。

import yaml
from pprint import pprint

with open('json_dict.json', 'rU') as f:
    data = yaml.load(f)

results = []
sementity_map = {}

def extract_analysis(l):
    for d in l:
        out = {
            'lemma': d['lemma'],
            'original_form': d['original_form'],
            'tag': d['tag']
        }

        if 'sense_id_list' in d:
            out['id'] = d['sense_id_list'][0]['sense_id']

        results.append( out )

def extract_entities(l):
    for d in l:
        if 'sementity' in d and 'id' in d['sementity']:
            sementity_map[ d['id'] ] = d['sementity']['id']


def find_analysis_and_entities(d):
    if type(d) != dict:  # Added for non-dict values
        return # Fail

    for k, v in d.items():
        if type(v) == list:
            if k == 'analysis_list':
                extract_analysis(v)
            elif k == 'entity_list':
                extract_entities(v)
            else:
                for do in v:
                    find_analysis_and_entities(do)
        else:
            find_analysis_and_entities(v)

def apply_entities(e, m):
    for d in e:
        if 'id' in d:
            if d['id'] in sementity_map:
                d['id'] = sementity_map[ d['id'] ]
            else:
                del d['id']

find_analysis_and_entities(data)
apply_entities(results, sementity_map)                

pprint(results)

对于语义ID,我们保留一个单独的地图字典,并在初始查找运行后应用它。第一个查找用于使用裸ID和语义实体映射构建结果。

部分问题(我认为)源于这样一个事实,即在找到必须应用的位置之前,您无法确定是否找到/传递了匹配的语义实体ID(使用dicts无效)。

这里我们只在找到它们时应用id映射,否则我们删除那个id字段。例如,a0a1a5401f__12123288058840445720都未列在entity_list块中,因此会从results中删除。

上面输出的示例输入文件是:

[{'lemma': 'Robert Downey Jr',
  'original_form': 'Robert Downey Jr',
  'tag': 'NPUU-N-'},
 {'lemma': 'Robert Downey Jr',
  'original_form': 'Robert Downey Jr',
  'tag': 'GNUS3S--'},
 {'lemma': 'top', 'original_form': 'has topped', 'tag': 'VI-S3PPA-N-N9'},
 {'id': 'ODENTITY_MAGAZINE',
  'lemma': 'Forbes',
  'original_form': 'Forbes',
  'tag': 'NP-S-N-'},
 {'lemma': 'magazine', 'original_form': 'magazine', 'tag': 'NC-S-N5'},
 {'lemma': 'magazine', 'original_form': 'Forbes magazine', 'tag': 'GN-S3---'},
 {'lemma': "'s", 'original_form': "'s", 'tag': 'WN-'},
 {'lemma': 'annual', 'original_form': 'annual', 'tag': 'AP-N5'},
 {'lemma': 'list', 'original_form': 'list', 'tag': 'NC-S-N5'},
 {'lemma': 'list', 'original_form': 'annual list', 'tag': 'GN-S3---'},
 {'id': 'ODENTITY_INDUSTRIAL_COMPANY',
  'lemma': 'John Deere',
  'original_form': 'John Deere',
  'tag': 'NP-S-N-'},
 {'lemma': 'John Deere', 'original_form': 'John Deere', 'tag': 'GN-S3Y--'},
 {'lemma': 'John Deere',
  'original_form': 'annual list John Deere',
  'tag': 'GN-S3---'},
 {'lemma': 'John Deere',
  'original_form': "Forbes magazine's annual list John Deere",
  'tag': 'GN-S3D--'},
 {'lemma': '*',
  'original_form': "Robert Downey Jr has topped Forbes magazine's annual list "
                   'John Deere',
  'tag': 'Z-----------'}]