源数据来自Python_for_Data_Analysis一书,第2页。 电影的数据如下,也可以找到here:
movies.head(n=10)
Out[3]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
使用iloc
时,以下代码出现问题:
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table( 'movies.dat', sep='::',
engine='python', header=None, names=mnames)
movies.head(n=10)
genre_iter = (set(x.split('|')) for x in movies['genres'])
genres = sorted(set.union(*genre_iter))
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)
for i, gen in enumerate(movies['genres']):
# the following code report error
# TypeError: '['Animation', "Children's", 'Comedy']' is an invalid key
dummies.iloc[i,dummies.columns.get_loc(gen.split('|'))] = 1
# while loc can run successfully
dummies.loc[dummies.index[[i]],gen.split('|')] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
我对儿童错误的原因有所了解,但为什么动画,喜剧是错误的?我试过了:
dummies.columns.get_loc('Animation')
,结果是2.
答案 0 :(得分:1)
这是一个非常简单(快速)的答案,使用字符串匹配,在这里应该可以正常工作,并且在任何情况下,类型名称都不会重叠。例如。如果你有类别"犯罪"和#34;犯罪惊悚片"然后,犯罪惊悚片将被归类为犯罪和犯罪惊悚片,而不仅仅是犯罪惊悚片。 (但请参阅下面的说明,了解如何概括这一点。)
for g in genres:
movies[g] = movies.genres.str.contains(g).astype(np.int8)
(注意使用np.int8而不是int将节省大量内存,因为int默认为64位而不是8位)
movies.head(2)
的结果:
movie_id title genres Action \
0 1 Toy Story (1995) Animation|Children's|Comedy 0
1 2 Jumanji (1995) Adventure|Children's|Fantasy 0
Adventure Animation Children's Comedy Crime Documentary ... \
0 0 1 1 1 0 0 ...
1 1 0 1 0 0 0 ...
Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller \
0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
War Western
0 0 0
1 0 0
以上对上述代码的概括可能有些过分,但会给你一种更通用的方法,以避免对类型类别进行潜在的重复计算(例如将犯罪和犯罪惊悚等同起来):
# add '|' delimiter to beginning and end of the genres column
movies['genres2'] = '|' + movies['genres'] + '|'
# search for '|Crime|' rather than 'Crime' which is much safer b/c
# we don't match a category which merely contains 'Crime', we
# only match 'Crime' exactly
for g in genres:
movies[g+'2'] movies.genres2.str.contains('\|'+g+'\|').astype(np.int8)
(如果您使用正则表达式比我更好,那么您不需要在开头和结尾添加&#39; |&#39; - <)
答案 1 :(得分:0)
尝试
dummies = movies.genres.str.get_dummies()