扁平化字典

时间:2018-12-31 00:33:40

标签: python recursion

我有以下一组字典(只有一个字典):

[{
    'RuntimeInMinutes': '21',
    'EpisodeNumber': '21',
    'Genres': ['Animation'],
    'ReleaseDate': '2005-02-05',
    'LanguageOfMetadata': 'EN',
    'Languages': [{
        '_Key': 'CC',
        'Value': ['en']
    }, {
        '_Key': 'Primary',
        'Value': ['EN']
    }],
    'Products': [{
        'URL': 'http://www.hulu.com/watch/217566',
        'Rating': 'TV-Y',
        'Currency': 'USD',
        'SUBSCRIPTION': '0.00',
        '_Key': 'US'
    }, {
        'URL': 'http://www.hulu.com/d/217566',
        'Rating': 'TV-Y',
        'Currency': 'USD',
        'SUBSCRIPTION': '0.00',
        '_Key': 'DE'
    }],
    'ReleaseYear': '2005',
    'TVSeriesID': '5638#TVSeries',
    'Type': 'TVEpisode',
    'Studio': '4K Media'
}]

我想按如下方式整理字典:

[{
    'RuntimeInMinutes': '21',
    'EpisodeNumber': '21',
    'Genres': ['Animation'],
    'ReleaseDate': '2005-02-05',
    'LanguageOfMetadata': 'EN',
    'Languages._Key': ['CC', 'Primary'],
    'Languages.Value': ['en', 'EN'],
    'Products.URL': ['http://www.hulu.com/watch/217566', 'http://www.hulu.com/d/217566'],
    'Products.Rating': ['TV-Y', 'TV-Y'],
    'Products.Currency': ['USD', 'USD'],
    'Products.SUBSCRIPTION': ['0.00', '0.00'],
    'Products._Key': ['US', 'DE'],
    'ReleaseYear': '2005',
    'TVSeriesID': '5638#TVSeries',
    'Type': 'TVEpisode',
    'Studio': '4K Media'
}]

换句话说,每当遇到字典时,都需要将其转换为字符串,数字或列表。

我目前拥有的是以下内容,它使用while循环遍历json的所有子路径。

    while True:

        for key in copy(keys):

            val = get_sub_object_from_path(obj, key)

            if isinstance(val, dict):
                FLAT_OBJ[key.replace('/', '.')] = val
            else:
                keys.extend(os.path.join(key, _nextkey) for _nextkey in val.keys())
            keys.remove(key)

        if (not keys) or (n > 5):
            break
        else:
            n += 1
            continue

3 个答案:

答案 0 :(得分:6)

您可以将递归与生成器一起使用:

from collections import defaultdict
_d = [{'RuntimeInMinutes': '21', 'EpisodeNumber': '21', 'Genres': ['Animation'], 'ReleaseDate': '2005-02-05', 'LanguageOfMetadata': 'EN', 'Languages': [{'_Key': 'CC', 'Value': ['en']}, {'_Key': 'Primary', 'Value': ['EN']}], 'Products': [{'URL': 'http://www.hulu.com/watch/217566', 'Rating': 'TV-Y', 'Currency': 'USD', 'SUBSCRIPTION': '0.00', '_Key': 'US'}, {'URL': 'http://www.hulu.com/d/217566', 'Rating': 'TV-Y', 'Currency': 'USD', 'SUBSCRIPTION': '0.00', '_Key': 'DE'}], 'ReleaseYear': '2005', 'TVSeriesID': '5638#TVSeries', 'Type': 'TVEpisode', 'Studio': '4K Media'}]

def get_vals(d, _path = []):
  for a, b in getattr(d, 'items', lambda :{})():
    if isinstance(b, list) and all(isinstance(i, dict) or isinstance(i, list) for i in b):
       for c in b:
         yield from get_vals(c, _path+[a])
    elif isinstance(b, dict):
       yield from get_vals(b, _path+[a])
    else:
       yield ['.'.join(_path+[a]), b]

results = [i for b in _d for i in get_vals(b)]
_c = defaultdict(list)
for a, b in results:
  _c[a].append(b)

result = [{a:list(b) if len(b) > 1 else b[0] for a, b in _c.items()}]
import json
print(json.dumps(result, indent=4))

输出:

[
  {
    "RuntimeInMinutes": "21",
    "EpisodeNumber": "21",
    "Genres": [
        "Animation"
    ],
    "ReleaseDate": "2005-02-05",
    "LanguageOfMetadata": "EN",
    "Languages._Key": [
        "CC",
        "Primary"
    ],
    "Languages.Value": [
        [
            "en"
        ],
        [
            "EN"
        ]
    ],
    "Products.URL": [
        "http://www.hulu.com/watch/217566",
        "http://www.hulu.com/d/217566"
    ],
    "Products.Rating": [
        "TV-Y",
        "TV-Y"
    ],
    "Products.Currency": [
        "USD",
        "USD"
    ],
    "Products.SUBSCRIPTION": [
        "0.00",
        "0.00"
    ],
    "Products._Key": [
        "US",
        "DE"
    ],
    "ReleaseYear": "2005",
    "TVSeriesID": "5638#TVSeries",
    "Type": "TVEpisode",
    "Studio": "4K Media"
  }
]

编辑:在外部函数中包装解决方案:

def flatten_obj(data):
  def get_vals(d, _path = []):
    for a, b in getattr(d, 'items', lambda :{})():
      if isinstance(b, list) and all(isinstance(i, dict) or isinstance(i, list) for i in b):
        for c in b:
          yield from get_vals(c, _path+[a])
      elif isinstance(b, dict):
        yield from get_vals(b, _path+[a])
      else:
        yield ['.'.join(_path+[a]), b]
  results = [i for b in data for i in get_vals(b)]
  _c = defaultdict(list)
  for a, b in results:
     _c[a].append(b)
  return [{a:list(b) if len(b) > 1 else b[0] for a, b in _c.items()}]

答案 1 :(得分:2)

编辑

现在看来该问题已解决:

  

@ panda-34正确指出(+1),即当前接受的   解决方案丢失数据,特别是GenresLanguages.Value时   您运行发布的代码。

不幸的是,@ panda-34的代码修改了Genres

'Genres': 'Animation',

而不是像OP的示例中那样单独放置它:

'Genres': ['Animation'],

下面是我的解决方案,以不同的方式解决问题。原始数据中的任何键都不包含字典作为值,仅包含非容器或列表(例如字典列表)。因此,一个主要的字典列表将成为一个列表字典(如果列表中只有一个字典,则只是一个普通字典。)完成之后,现在是字典的任何值都将扩展回原始数据结构:

def flatten(container):
    # A list of dictionaries becomes a dictionary of lists (unless only one dictionary in list)
    if isinstance(container, list) and all(isinstance(element, dict) for element in container):
        new_dictionary = {}

        first, *rest = container

        for key, value in first.items():
            new_dictionary[key] = [flatten(value)] if rest else flatten(value)

        for dictionary in rest:
            for key, value in dictionary.items():
                new_dictionary[key].append(value)

        container = new_dictionary

    # Any dictionary value that's a dictionary is expanded into original dictionary
    if isinstance(container, dict):
        new_dictionary = {}

        for key, value in container.items():
            if isinstance(value, dict):
                for sub_key, sub_value in value.items():
                    new_dictionary[key + "." + sub_key] = sub_value
            else:
                new_dictionary[key] = value

        container = new_dictionary

    return container

输出

{
    "RuntimeInMinutes": "21",
    "EpisodeNumber": "21",
    "Genres": [
        "Animation"
    ],
    "ReleaseDate": "2005-02-05",
    "LanguageOfMetadata": "EN",
    "Languages._Key": [
        "CC",
        "Primary"
    ],
    "Languages.Value": [
        [
            "en"
        ],
        [
            "EN"
        ]
    ],
    "Products.URL": [
        "http://www.hulu.com/watch/217566",
        "http://www.hulu.com/d/217566"
    ],
    "Products.Rating": [
        "TV-Y",
        "TV-Y"
    ],
    "Products.Currency": [
        "USD",
        "USD"
    ],
    "Products.SUBSCRIPTION": [
        "0.00",
        "0.00"
    ],
    "Products._Key": [
        "US",
        "DE"
    ],
    "ReleaseYear": "2005",
    "TVSeriesID": "5638#TVSeries",
    "Type": "TVEpisode",
    "Studio": "4K Media"
}

但是此解决方案引入了一个新的明显的不一致之处:

'Languages.Value': ['en', 'EN'],

vs。

"Languages.Value": [["en"], ["EN"]],

但是,我认为这与前面提到的Genres不一致有关,OP需要定义一致的分辨率。

答案 2 :(得分:0)

Ajax1234的答案丢失了“流派”和“ Languages.Value”的值 这是更通用的版本:

def flatten_obj(data):
    def flatten_item(item, keys):
        if isinstance(item, list):
            for v in item:
                yield from flatten_item(v, keys)
        elif isinstance(item, dict):
            for k, v in item.items():
                yield from flatten_item(v, keys+[k])
        else:
            yield '.'.join(keys), item

    res = []
    for item in data:
        res_item = defaultdict(list)
        for k, v in flatten_item(item, []):
            res_item[k].append(v)
        res.append({k: (v if len(v) > 1 else v[0]) for k, v in res_item.items()})
    return res

P.S。 “类型”值也变平。要么是OP要求不一致,要么是此答案未解决的单独问题。