大熊猫：从列内的值创建伪变量

时间：2020-03-27 16:04:35

标签： python pandas dataframe split dummy-variable

我有一个数据框，其中的列称为Actors，其中每个单元格都包含一个类似"Abigail Breslin, Greg Kinnear, Paul Dano, Alan Arkin"的字符串。我希望在(",")上拆分此字符串，以便该单元格包含每个演员的列表，即["Abigail Breslin", "Greg Kinnear, "Paul Dano, "Alan Arkin"]，以便为每个唯一演员创建虚拟变量。我还找到了一种解决方案，实际上可以将字符串分成几部分，然后将相应的actor名称发送到新列中。

任何帮助将不胜感激：）

我的数据框（df）看起来像这样

Title (Object)| Actors (Object)                                              |  Year (Object)    
Pulp Fiction  | Bruce Willis, Amanda Plummer, Laura Lovelace, John Travolta  |  1994
Fight Club    | Edward Norton, Brad Pitt, Helena Bonham Carter, Meat Loaf    |  1999

我的目标是使数据框看起来像这样

Title (Object)| Bruce Willis | Amanda Plummer | Laura Lovelace | John Travolta | Edward Norton | Year   
Pulp Fiction  |       1      |        1       |       1        |      1        |       0       | 1994
Fight Club    |       0      |        0       |       0        |      0        |       1       | 1999

我尝试过

import pandas as pd 

data = 'Imdb_datajson(Cleaned).csv'

df = pd.read_csv(data)
    list_of_unique_actors = df.Actors.unique().tolist()
    list_of_unique_actors
    
    newlist = []
    for actor in list_of_unique_actors:
        actor = actor.split(",")
        newlist.extend(actor)

并收到此错误

    AttributeError                            Traceback (most recent call last)
<ipython-input-48-ae50a804fe05> in <module>
      5 newlist = []
      6 for word in list_of_unique_actors:
----> 7     word = word.split(",")
      8     newlist.extend(word)
      9 return newlist

AttributeError: 'float' object has no attribute 'split'

1 个答案:

答案 0 :(得分：0)

使用pd.get_dummies()

# sample data
s = """Title (Object)|Actors (Object)|Year (Object)
Pulp Fiction|Bruce Willis, Amanda Plummer, Laura Lovelace, John Travolta|1994
Fight Club|Edward Norton, Brad Pitt, Helena Bonham Carter, Meat Loaf|1999"""
# read csv
df = pd.read_csv(StringIO(s), sep='|')

# split your string of actors into a list
df['Actors (Object)'] = df['Actors (Object)'].str.split(', ')
# set the title and year as index
df = df.set_index(['Title (Object)', 'Year (Object)'])
# get_dummies
dummy_df = pd.get_dummies(df['Actors (Object)'].apply(pd.Series).stack()).sum(level=[0,1])


                               Edward Norton  Amanda Plummer  Brad Pitt  \
Title (Object) Year (Object)                                              
Pulp Fiction   1994                        0               1          0   
Fight Club     1999                        1               0          1   

                              Bruce Willis  Helena Bonham Carter  \
Title (Object) Year (Object)                                       
Pulp Fiction   1994                      1                     0   
Fight Club     1999                      0                     1   

                              John Travolta  Laura Lovelace  Meat Loaf  
Title (Object) Year (Object)                                            
Pulp Fiction   1994                       1               1          0  
Fight Club     1999                       0               0          1