从python中的复杂数据结构中提取数据

时间:2017-03-24 13:07:21

标签: python list dictionary extraction

我有一个像

这样的数据结构
[ {'uid': 'test_subject145', 'class':'?',  'data':[  {'chunk':1, 'writing':[ ['this is exciting'],[ 'you are good' ]... ]}  ]  },
  {'uid': 'test_subject166', 'class':'?',  'data':[  {'chunk':2, 'writing':[ ['he died'],[ 'go ahead' ]... ]}  ] }, ...]

它是一个包含许多词典的列表,每个词典都有 3对 'uid': 'test_subject145', 'class':'?', 'data':[]。 在最后一对 'data' 中,该值是一个列表,它再次包含一个字典,其中包含 2对 'chunk':1, 'writing':[] #39; 撰写',其值列表再次包含多个列表。 我要提取的是所有这些句子的内容,如'this is exciting''you are good'等,然后放入一个简单的列表中。其最终形式应为 list_final = ['this is exciting', 'you are good', 'he died',... ]

3 个答案:

答案 0 :(得分:3)

鉴于您的原始列表名为input,只需使用list comprehension:

[elem for dic in input
      for dat in dic.get('data',())
      for writing in dat.get('writing',())
      for elem in writing]

你可以使用.get(..,()),如果没有这样的密钥,它仍然可以工作:如果没有这样的密钥,我们返回空元组(),所以没有迭代。

根据您的示例输入,我们得到:

>>> input = [ {'uid': 'test_subject145', 'class':'?',  'data':[  {'chunk':1, 'writing':[ ['this is exciting'],[ 'you are good' ]]}  ]  },
...       {'uid': 'test_subject166', 'class':'?',  'data':[  {'chunk':2, 'writing':[ ['he died'],[ 'go ahead' ] ]}  ] }]
>>> 
>>> [elem for dic in input
...       for dat in dic.get('data',())
...       for writing in dat.get('writing',())
...       for elem in writing]
['this is exciting', 'you are good', 'he died', 'go ahead']

答案 1 :(得分:2)

TL;博士

[str for dic in data
     for data_dict in dic['data']
     for writing_sub_list in data_dict['writing']
     for str in writing_sub_list]

慢慢来,一次做一层。然后重构代码以使其更小。

data = [{'class': '?',
         'data': [{'chunk': 1,
                   'writing': [['this is exciting'], ['you are good']]}],
         'uid': 'test_subject145'},
        {'class': '?',
         'data': [{'chunk': 2,
         'writing': [['he died'], ['go ahead']]}],
         'uid': 'test_subject166'}]

for d in data:
    print(d)
# {'class': '?', 'uid': 'test_subject145', 'data': [{'writing': [['this is exciting'], ['you are good']], 'chunk': 1}]}
# {'class': '?', 'uid': 'test_subject166', 'data': [{'writing': [['he died'], ['go ahead']], 'chunk': 2}]}

for d in data:
     data_list = d['data']
     print(data_list)
# [{'writing': [['this is exciting'], ['you are good']], 'chunk': 1}]
# [{'writing': [['he died'], ['go ahead']], 'chunk': 2}]

for d in data:
     data_list = d['data']
     for d2 in data_list:
         print(d2)
# {'writing': [['this is exciting'], ['you are good']], 'chunk': 1}
# {'writing': [['he died'], ['go ahead']], 'chunk': 2}

for d in data:
     data_list = d['data']
     for d2 in data_list:
         writing_list = d2['writing']
         print(writing_list)
# [['this is exciting'], ['you are good']]
# [['he died'], ['go ahead']]

for d in data:
     data_list = d['data']
     for d2 in data_list:
         writing_list = d2['writing']
         for writing_sub_list in writing_list:
             print(writing_sub_list)
# ['this is exciting']
# ['you are good']
# ['he died']
# ['go ahead']

for d in data:
     data_list = d['data']
     for d2 in data_list:
         writing_list = d2['writing']
         for writing_sub_list in writing_list:
             for str in writing_sub_list:
                  print(str)
# this is exciting
# you are good
# he died
# go ahead

然后转换为更小的(但难以阅读),重写上面这样的代码。应该很容易看出如何从一个到另一个:

strings = [str for d in data for d2 in d['data'] for wsl in d2['writing'] for str in wsl]
# ['this is exciting', 'you are good', 'he died', 'go ahead']

然后,用更好的名字来表达它,比如Willem的回答:

[str for dic in data
     for data_dict in dic['data']
     for writing_sub_list in data_dict['writing']
     for str in writing_sub_list]

答案 2 :(得分:1)

所以我相信以下内容可行

;WITH CTE_DIFF AS (
   SELECT [TimeStamp], [State], 
          DATEDIFF ( second , 
                    [TimeStamp] , 
                    LEAD([TimeStamp]) OVER (ORDER BY [TimeStamp])) AS time_diff 
   FROM mytable
), CTE_PERC AS (
   SELECT [TimeStamp], [State], time_diff ,
          SUM(time_diff) OVER (ORDER BY [TimeStamp]) * 1.0 / 
          SUM(time_diff) OVER () * 100 AS perc
   FROM CTE_DIFF
)
SELECT [TimeStamp], [State], 
       COALESCE(LAG(perc) OVER (ORDER BY [TimeStamp]), 0) AS PercentageStart,
       perc AS PercentageEnd
FROM CTE_PERC 

如上所述,此项目我认为有助于理解 - python getting a list of value from list of dict(感谢麦格雷迪)

相关问题