您好我想从DataFrame
列表创建一个dicts
,其中的项目是列表。如果这些项是标量,请参阅下面的test
,对pd.DataFrame
的调用按预期工作:
test = [{'points': 40, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
pd.DataFrame(test)
month points points_h1 time year
0 NaN 40.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
但是,如果这些项目本身就是列表,我会得到一个看似意外的结果:
test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]},
{'points': [25], 'time': ['6:00'], 'month': ["february"]},
{'points':[90], 'time': ['9:00'], 'month': ['january']},
{'points_h1': [20], 'month': ['june']}]
pd.DataFrame(test)
month points points_h1 time year
0 NaN [40, 50] NaN [5:00, 4:00] [2010, 2011]
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
要解决此问题,我使用:pd.concat([pd.DataFrame(z) for z in test])
,但这相对较慢,因为您必须为列表中的每个元素创建一个新的数据帧,这需要很大的开销。我错过了什么吗?
答案 0 :(得分:0)
虽然在熊猫本身可能,但使用Python似乎不那么困难,至少如果你有原始数据。
import pandas as pd
test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]}, {'points': [25], 'time': ['6:00'], 'month': ["february"]}, {'points':[90], 'time': ['9:00'], 'month': ['january']}, {'points_h1': [20], 'month': ['june']}]
newtest = []
for t in test:
newtest.extend([{k:v for (k,v) in zip(t.keys(),values)} for values in zip(*t.values())])
df = pd.DataFrame(newtest)
print (df)
结果:
month points points_h1 time year
0 NaN 40.0 NaN 5:00 2010.0
1 NaN 50.0 NaN 4:00 2011.0
2 february 25.0 NaN 6:00 NaN
3 january 90.0 NaN 9:00 NaN
4 june NaN 20.0 NaN NaN
答案 1 :(得分:0)
使用pandas
可以使用多种方法组合来获取数据,但正如您所发现的那样,它可能变得非常繁重。我的建议是在传递到pandas之前填充你的数据:
import pandas as pd
test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]},
{'month': ['february'], 'points': [25], 'time': ['6:00']},
{'month': ['january'], 'points': [90], 'time': ['9:00']},
{'month': ['june'], 'points_h1': [20]}]
def pad_data(data):
# Set a dictionary with all the keys
result = {k:[] for i in data for k in i.keys()}
for i in data:
# Determine the longest value as padding for NaNs
pad = max([len(j) for j in i.values()])
# Create padding dictionary and update current
padded = {key: [pd.np.nan]*pad for key in result.keys() if key not in i.keys()}
i.update(padded)
# Finally extend to result dictionary
for key, val in i.items():
result[key].extend(val)
return result
# Padded data looks like this:
#
# {'month': [nan, nan, 'february', 'january', 'june'],
# 'points': [40, 50, 25, 90, nan],
# 'points_h1': [nan, nan, nan, nan, 20],
# 'time': ['5:00', '4:00', '6:00', '9:00', nan],
# 'year': [2010, 2011, nan, nan, nan]}
df = pd.DataFrame(pad_data(test), dtype='O')
print(df)
# month points points_h1 time year
# 0 NaN 40 NaN 5:00 2010
# 1 NaN 50 NaN 4:00 2011
# 2 february 25 NaN 6:00 NaN
# 3 january 90 NaN 9:00 NaN
# 4 june NaN 20 NaN NaN