我有一个数据框,其中的列称为Actors
,其中每个单元格都包含一个类似"Abigail Breslin, Greg Kinnear, Paul Dano, Alan Arkin"
的字符串。我希望在(",")
上拆分此字符串,以便该单元格包含每个演员的列表,即["Abigail Breslin", "Greg Kinnear, "Paul Dano, "Alan Arkin"]
,以便为每个唯一演员创建虚拟变量。我还找到了一种解决方案,实际上可以将字符串分成几部分,然后将相应的actor名称发送到新列中。
任何帮助将不胜感激:)
Title (Object)| Actors (Object) | Year (Object)
Pulp Fiction | Bruce Willis, Amanda Plummer, Laura Lovelace, John Travolta | 1994
Fight Club | Edward Norton, Brad Pitt, Helena Bonham Carter, Meat Loaf | 1999
Title (Object)| Bruce Willis | Amanda Plummer | Laura Lovelace | John Travolta | Edward Norton | Year
Pulp Fiction | 1 | 1 | 1 | 1 | 0 | 1994
Fight Club | 0 | 0 | 0 | 0 | 1 | 1999
import pandas as pd
data = 'Imdb_datajson(Cleaned).csv'
df = pd.read_csv(data)
list_of_unique_actors = df.Actors.unique().tolist()
list_of_unique_actors
newlist = []
for actor in list_of_unique_actors:
actor = actor.split(",")
newlist.extend(actor)
并收到此错误
AttributeError Traceback (most recent call last)
<ipython-input-48-ae50a804fe05> in <module>
5 newlist = []
6 for word in list_of_unique_actors:
----> 7 word = word.split(",")
8 newlist.extend(word)
9 return newlist
AttributeError: 'float' object has no attribute 'split'
答案 0 :(得分:0)
使用pd.get_dummies()
# sample data
s = """Title (Object)|Actors (Object)|Year (Object)
Pulp Fiction|Bruce Willis, Amanda Plummer, Laura Lovelace, John Travolta|1994
Fight Club|Edward Norton, Brad Pitt, Helena Bonham Carter, Meat Loaf|1999"""
# read csv
df = pd.read_csv(StringIO(s), sep='|')
# split your string of actors into a list
df['Actors (Object)'] = df['Actors (Object)'].str.split(', ')
# set the title and year as index
df = df.set_index(['Title (Object)', 'Year (Object)'])
# get_dummies
dummy_df = pd.get_dummies(df['Actors (Object)'].apply(pd.Series).stack()).sum(level=[0,1])
Edward Norton Amanda Plummer Brad Pitt \
Title (Object) Year (Object)
Pulp Fiction 1994 0 1 0
Fight Club 1999 1 0 1
Bruce Willis Helena Bonham Carter \
Title (Object) Year (Object)
Pulp Fiction 1994 1 0
Fight Club 1999 0 1
John Travolta Laura Lovelace Meat Loaf
Title (Object) Year (Object)
Pulp Fiction 1994 1 1 0
Fight Club 1999 0 0 1