从Dicts列表创建DataFrame - 其中值是列表本身

时间:2018-02-16 21:04:15

标签: python pandas

您好我想从DataFrame列表创建一个dicts,其中的项目是列表。如果这些项是标量,请参阅下面的test,对pd.DataFrame的调用按预期工作:

test = [{'points': 40, 'time': '5:00', 'year': 2010}, 
{'points': 25, 'time': '6:00', 'month': "february"}, 
{'points':90, 'time': '9:00', 'month': 'january'}, 
{'points_h1':20, 'month': 'june'}]

pd.DataFrame(test)

    month    points  points_h1  time    year
0   NaN      40.0    NaN        5:00    2010.0
1   february 25.0    NaN        6:00    NaN
2   january  90.0    NaN        9:00    NaN
3   june      NaN    20.0        NaN    NaN

但是,如果这些项目本身就是列表,我会得到一个看似意外的结果:

test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]}, 
{'points': [25], 'time': ['6:00'], 'month': ["february"]}, 
{'points':[90], 'time': ['9:00'], 'month': ['january']}, 
{'points_h1': [20], 'month': ['june']}]

pd.DataFrame(test)

        month      points   points_h1          time            year
   0    NaN      [40, 50]   NaN         [5:00, 4:00]    [2010, 2011]
   1    february       25   NaN                 6:00             NaN
   2    january        90   NaN                 9:00             NaN
   3    june          NaN   20.0                 NaN             NaN

要解决此问题,我使用:pd.concat([pd.DataFrame(z) for z in test]),但这相对较慢,因为您必须为列表中的每个元素创建一个新的数据帧,这需要很大的开销。我错过了什么吗?

2 个答案:

答案 0 :(得分:0)

虽然在熊猫本身可能,但使用Python似乎不那么困难,至少如果你有原始数据。

import pandas as pd

test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]}, {'points': [25], 'time': ['6:00'], 'month': ["february"]}, {'points':[90], 'time': ['9:00'], 'month': ['january']}, {'points_h1': [20], 'month': ['june']}]

newtest = []
for t in test:
    newtest.extend([{k:v for (k,v) in zip(t.keys(),values)} for values in zip(*t.values())])

df = pd.DataFrame(newtest)
print (df)

结果:

      month  points  points_h1  time    year
0       NaN    40.0        NaN  5:00  2010.0
1       NaN    50.0        NaN  4:00  2011.0
2  february    25.0        NaN  6:00     NaN
3   january    90.0        NaN  9:00     NaN
4      june     NaN       20.0   NaN     NaN

答案 1 :(得分:0)

使用pandas可以使用多种方法组合来获取数据,但正如您所发现的那样,它可能变得非常繁重。我的建议是在传递到pandas之前填充你的数据:

import pandas as pd

test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]},
 {'month': ['february'], 'points': [25], 'time': ['6:00']},
 {'month': ['january'], 'points': [90], 'time': ['9:00']},
 {'month': ['june'], 'points_h1': [20]}]

def pad_data(data):

    # Set a dictionary with all the keys
    result = {k:[] for i in data for k in i.keys()}

    for i in data:

        # Determine the longest value as padding for NaNs
        pad = max([len(j) for j in i.values()])

        # Create padding dictionary and update current
        padded = {key: [pd.np.nan]*pad for key in result.keys() if key not in i.keys()}
        i.update(padded)

        # Finally extend to result dictionary
        for key, val in i.items():
            result[key].extend(val)

    return result

# Padded data looks like this:
#
# {'month': [nan, nan, 'february', 'january', 'june'],
#  'points': [40, 50, 25, 90, nan],
#  'points_h1': [nan, nan, nan, nan, 20],
#  'time': ['5:00', '4:00', '6:00', '9:00', nan],
#  'year': [2010, 2011, nan, nan, nan]}

df = pd.DataFrame(pad_data(test), dtype='O')
print(df)

#       month points points_h1  time  year
# 0       NaN     40       NaN  5:00  2010
# 1       NaN     50       NaN  4:00  2011
# 2  february     25       NaN  6:00   NaN
# 3   january     90       NaN  9:00   NaN
# 4      june    NaN        20   NaN   NaN