我的pandas数据框包含一个包含值或值列表(长度不等)的列。我想“扩展”行,因此列表中的每个值都会成为列中的单个值。一个例子说明了一切:
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})
location name
0 Amsterdam Tom
1 [Berlin, Paris] Jim
2 [Antwerp, Barcelona, Pisa] Claus
我想变成:
dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Antwerp Claus
4 Barcelona Claus
5 Pisa Claus
我首先尝试使用apply但据我所知,不可能返回多个系列。 iterrows似乎是诀窍。但是下面的代码给了我一个空数据框......
def duplicator(series):
if type(series['location']) == list:
for location in series['location']:
subSeries = series
subSeries['location'] = location
dfOut.append(subSeries)
else:
dfOut.append(series)
for index, row in dfIn.iterrows():
duplicator(row)
答案 0 :(得分:8)
没有那么多有趣/花哨的熊猫用法,但这有效:
import numpy as np
dfIn.loc[:, 'location'] = dfIn.location.apply(np.atleast_1d)
all_locations = np.hstack(dfIn.location)
all_names = np.hstack([[n]*len(l) for n, l in dfIn[['name', 'location']].values])
dfOut = pd.DataFrame({'location':all_locations, 'name':all_names})
它比apply / stack / reindex方法快约40倍。据我所知,该比率几乎适用于所有数据帧大小(没有测试它如何随着每行中列表的大小而缩放)。如果您可以保证所有location
个条目都已经是可迭代的,那么您可以删除atleast_1d
调用,这样可以提高另外20%的速度。
答案 1 :(得分:5)
如果您返回index
是地理列表的系列,那么dfIn.apply
会将这些系列整理成表格:
import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'],
['Antwerp','Barcelona','Pisa'] ]})
def expand(row):
locations = row['location'] if isinstance(row['location'], list) else [row['location']]
s = pd.Series(row['name'], index=list(set(locations)))
return s
In [156]: dfIn.apply(expand, axis=1)
Out[156]:
Amsterdam Antwerp Barcelona Berlin Paris Pisa
0 Tom NaN NaN NaN NaN NaN
1 NaN NaN NaN Jim Jim NaN
2 NaN Claus Claus NaN NaN Claus
然后,您可以堆叠此DataFrame以获取:
In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]:
0 Amsterdam Tom
1 Berlin Jim
Paris Jim
2 Antwerp Claus
Barcelona Claus
Pisa Claus
dtype: object
这是一个系列,而你想要一个DataFrame。使用reset_index
进行一点按摩可以获得所需的结果:
dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)
产量
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Amsterdam Claus
4 Antwerp Claus
5 Barcelona Claus
答案 2 :(得分:0)
import pandas as pd
dfIn = pd.DataFrame({
u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ],
})
print(dfIn.explode('location'))
>>>
name location
0 Tom Amsterdam
1 Jim Berlin
1 Jim Paris
2 Claus Antwerp
2 Claus Barcelona
2 Claus Pisa