我正在尝试将Pandas DataFrame转换为新的Pandas DataFrame,其中某个列中的每个项都有自己的行。例如:
在:
ID Name Date Location
0 0 John, Dave 01/01/1992 Mexico
1 1 Tim 07/07/1997 Australia
2 2 Mike, John 12/24/2012 Zambia
3 3 Bob, Rick, Tony 05/17/2007 Cuba
4 4 Roger 04/05/2000 Iceland
5 5 Carlos 05/24/1995 Guam
当前解决方案:
new_df = pd.DataFrame(columns = df.columns)
for index,row in df.iterrows():
new_row = pd.DataFrame(df.loc[index]).transpose()
target_info = df.loc[index,'Name']
if (len(target_info.split(',')) > 1):
for item in target_info.split(','):
new_row.loc[index,'Name'] = item
new_df = new_df.append(new_row)
else:
new_df = new_df.append(new_row)
后:
ID Name Date Location
0 0 John 01/01/1992 Mexico
0 0 Dave 01/01/1992 Mexico
1 1 Tim 07/07/1997 Australia
2 2 Mike 12/24/2012 Zambia
2 2 John 12/24/2012 Zambia
3 3 Bob 05/17/2007 Cuba
3 3 Rick 05/17/2007 Cuba
3 3 Tony 05/17/2007 Cuba
4 4 Roger 04/05/2000 Iceland
5 5 Carlos 05/24/1995 Guam
当然有更优雅的东西?
答案 0 :(得分:2)
您可以将拆分名称作为系列,删除现有的名称列,然后加入拆分名称。
# Split the 'Name' column as a Series, setting the appropriate name and index.
split_names = df['Name'].str.split(', ', expand=True).stack()
split_names.name = 'Name'
split_names.index = split_names.index.get_level_values(0)
# Drop the existing 'Name' column, and join the split names.
df.drop('Name', axis=1, inplace=True)
df = df.join(split_names)
结果输出与示例相同,但最后是Name列。如果您想要原始订单,可以对列重新排序。
ID Date Location Name
0 0 01/01/1992 Mexico John
0 0 01/01/1992 Mexico Dave
1 1 07/07/1997 Australia Tim
2 2 12/24/2012 Zambia Mike
2 2 12/24/2012 Zambia John
3 3 05/17/2007 Cuba Bob
3 3 05/17/2007 Cuba Rick
3 3 05/17/2007 Cuba Tony
4 4 04/05/2000 Iceland Roger
5 5 05/24/1995 Guam Carlos
答案 1 :(得分:1)
你可以这样做:
nm = df.Name.str.split(',\s*', expand=True)
cols=list(set(df.columns) - set(['Name']))
pd.melt(df[cols].join(nm),
id_vars=cols,
value_vars=nm.columns.tolist(),
value_name='Name') \
.dropna() \
.drop(['variable'], axis=1) \
.sort_values('ID')
一步一步:
In [128]: nm = df.Name.str.split(',\s*', expand=True)
In [129]: nm
Out[129]:
0 1 2
0 John Dave None
1 Tim None None
2 Mike John None
3 Bob Rick Tony
4 Roger None None
5 Carlos None None
In [130]: cols=list(set(df.columns) - set(['Name']))
In [131]: cols
Out[131]: ['Date', 'ID', 'Location']
In [133]: pd.melt(df[cols].join(nm),
.....: id_vars=cols,
.....: value_vars=nm.columns.tolist(),
.....: value_name='Name') \
.....: .dropna() \
.....: .drop(['variable'], axis=1) \
.....: .sort_values('ID')
Out[133]:
Date ID Location Name
0 01/01/1992 0 Mexico John
6 01/01/1992 0 Mexico Dave
1 07/07/1997 1 Australia Tim
2 12/24/2012 2 Zambia Mike
8 12/24/2012 2 Zambia John
3 05/17/2007 3 Cuba Bob
9 05/17/2007 3 Cuba Rick
15 05/17/2007 3 Cuba Tony
4 04/05/2000 4 Iceland Roger
5 05/24/1995 5 Guam Carlos