Question

我正在尝试将Pandas DataFrame转换为新的Pandas DataFrame，其中某个列中的每个项都有自己的行。例如：

在：

   ID             Name        Date   Location
0   0       John, Dave  01/01/1992     Mexico
1   1              Tim  07/07/1997  Australia
2   2       Mike, John  12/24/2012     Zambia
3   3  Bob, Rick, Tony  05/17/2007       Cuba
4   4            Roger  04/05/2000    Iceland
5   5           Carlos  05/24/1995       Guam

当前解决方案：

new_df = pd.DataFrame(columns = df.columns)
for index,row in df.iterrows():
    new_row = pd.DataFrame(df.loc[index]).transpose()
    target_info = df.loc[index,'Name']
    if (len(target_info.split(',')) > 1):
        for item in target_info.split(','):
            new_row.loc[index,'Name'] = item
           new_df = new_df.append(new_row)
    else:
        new_df = new_df.append(new_row)

后：

  ID    Name        Date   Location
0  0    John  01/01/1992     Mexico
0  0    Dave  01/01/1992     Mexico
1  1     Tim  07/07/1997  Australia
2  2    Mike  12/24/2012     Zambia
2  2    John  12/24/2012     Zambia
3  3     Bob  05/17/2007       Cuba
3  3    Rick  05/17/2007       Cuba
3  3    Tony  05/17/2007       Cuba
4  4   Roger  04/05/2000    Iceland
5  5  Carlos  05/24/1995       Guam

当然有更优雅的东西？

Answer 1

您可以将拆分名称作为系列，删除现有的名称列，然后加入拆分名称。

# Split the 'Name' column as a Series, setting the appropriate name and index.
split_names = df['Name'].str.split(', ', expand=True).stack()
split_names.name = 'Name'
split_names.index = split_names.index.get_level_values(0)

# Drop the existing 'Name' column, and join the split names.
df.drop('Name', axis=1, inplace=True)
df = df.join(split_names)

结果输出与示例相同，但最后是Name列。如果您想要原始订单，可以对列重新排序。

   ID        Date   Location    Name
0   0  01/01/1992     Mexico    John
0   0  01/01/1992     Mexico    Dave
1   1  07/07/1997  Australia     Tim
2   2  12/24/2012     Zambia    Mike
2   2  12/24/2012     Zambia    John
3   3  05/17/2007       Cuba     Bob
3   3  05/17/2007       Cuba    Rick
3   3  05/17/2007       Cuba    Tony
4   4  04/05/2000    Iceland   Roger
5   5  05/24/1995       Guam  Carlos

Answer 2

你可以这样做：

nm = df.Name.str.split(',\s*', expand=True)
cols=list(set(df.columns) - set(['Name']))

pd.melt(df[cols].join(nm),
        id_vars=cols,
        value_vars=nm.columns.tolist(),
        value_name='Name') \
  .dropna() \
  .drop(['variable'], axis=1) \
  .sort_values('ID')

一步一步：

In [128]: nm = df.Name.str.split(',\s*', expand=True)

In [129]: nm
Out[129]:
        0     1     2
0    John  Dave  None
1     Tim  None  None
2    Mike  John  None
3     Bob  Rick  Tony
4   Roger  None  None
5  Carlos  None  None

In [130]: cols=list(set(df.columns) - set(['Name']))

In [131]: cols
Out[131]: ['Date', 'ID', 'Location']

In [133]: pd.melt(df[cols].join(nm),
   .....:         id_vars=cols,
   .....:         value_vars=nm.columns.tolist(),
   .....:         value_name='Name') \
   .....:   .dropna() \
   .....:   .drop(['variable'], axis=1) \
   .....:   .sort_values('ID')
Out[133]:
          Date  ID   Location    Name
0   01/01/1992   0     Mexico    John
6   01/01/1992   0     Mexico    Dave
1   07/07/1997   1  Australia     Tim
2   12/24/2012   2     Zambia    Mike
8   12/24/2012   2     Zambia    John
3   05/17/2007   3       Cuba     Bob
9   05/17/2007   3       Cuba    Rick
15  05/17/2007   3       Cuba    Tony
4   04/05/2000   4    Iceland   Roger
5   05/24/1995   5       Guam  Carlos

Python Pandas

2 个答案: