使用iloc创建虚拟变量时的TypeError

时间:2017-07-09 13:39:11

标签: python pandas

源数据来自Python_for_Data_Analysis一书,第2页。 电影的数据如下,也可以找到here

movies.head(n=10)
Out[3]: 
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy
5         6                         Heat (1995)         Action|Crime|Thriller
6         7                      Sabrina (1995)                Comedy|Romance
7         8                 Tom and Huck (1995)          Adventure|Children's
8         9                 Sudden Death (1995)                        Action
9        10                    GoldenEye (1995)     Action|Adventure|Thriller

使用iloc时,以下代码出现问题:

import pandas as pd
import numpy as np
from pandas import Series,DataFrame

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table( 'movies.dat', sep='::', 
                       engine='python', header=None, names=mnames)
movies.head(n=10)
genre_iter = (set(x.split('|')) for x in movies['genres'])
genres = sorted(set.union(*genre_iter))
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)

for i, gen in enumerate(movies['genres']):
# the following code report error
# TypeError: '['Animation', "Children's", 'Comedy']' is an invalid key
    dummies.iloc[i,dummies.columns.get_loc(gen.split('|'))] = 1
# while loc can run successfully
    dummies.loc[dummies.index[[i]],gen.split('|')] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

我对儿童错误的原因有所了解,但为什么动画喜剧是错误的?我试过了:

dummies.columns.get_loc('Animation')

,结果是2.

2 个答案:

答案 0 :(得分:1)

这是一个非常简单(快速)的答案,使用字符串匹配,在这里应该可以正常工作,并且在任何情况下,类型名称都不会重叠。例如。如果你有类别"犯罪"和#34;犯罪惊悚片"然后,犯罪惊悚片将被归类为犯罪和犯罪惊悚片,而不仅仅是犯罪惊悚片。 (但请参阅下面的说明,了解如何概括这一点。)

for g in genres:
    movies[g] = movies.genres.str.contains(g).astype(np.int8)

(注意使用np.int8而不是int将节省大量内存,因为int默认为64位而不是8位)

movies.head(2)的结果:

   movie_id             title                        genres  Action  \
0         1  Toy Story (1995)   Animation|Children's|Comedy       0   
1         2    Jumanji (1995)  Adventure|Children's|Fantasy       0   

   Adventure  Animation  Children's  Comedy  Crime  Documentary   ...     \
0          0          1           1       1      0            0   ...      
1          1          0           1       0      0            0   ...      

   Fantasy  Film-Noir  Horror  Musical  Mystery  Romance  Sci-Fi  Thriller  \
0        0          0       0        0        0        0       0         0   
1        1          0       0        0        0        0       0         0   

   War  Western  
0    0        0  
1    0        0  

以上对上述代码的概括可能有些过分,但会给你一种更通用的方法,以避免对类型类别进行潜在的重复计算(例如将犯罪和犯罪惊悚等同起来):

# add '|' delimiter to beginning and end of the genres column
movies['genres2'] = '|' + movies['genres'] + '|'

# search for '|Crime|' rather than 'Crime' which is much safer b/c
# we don't match a category which merely contains 'Crime', we 
# only match 'Crime' exactly
for g in genres:
    movies[g+'2'] movies.genres2.str.contains('\|'+g+'\|').astype(np.int8)

(如果您使用正则表达式比我更好,那么您不需要在开头和结尾添加&#39; |&#39; - <)

答案 1 :(得分:0)

尝试

dummies = movies.genres.str.get_dummies()