Question

源数据来自Python_for_Data_Analysis一书，第2页。电影的数据如下，也可以找到here：

movies.head(n=10)
Out[3]: 
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy
5         6                         Heat (1995)         Action|Crime|Thriller
6         7                      Sabrina (1995)                Comedy|Romance
7         8                 Tom and Huck (1995)          Adventure|Children's
8         9                 Sudden Death (1995)                        Action
9        10                    GoldenEye (1995)     Action|Adventure|Thriller

使用iloc时，以下代码出现问题：

import pandas as pd
import numpy as np
from pandas import Series,DataFrame

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table( 'movies.dat', sep='::', 
                       engine='python', header=None, names=mnames)
movies.head(n=10)
genre_iter = (set(x.split('|')) for x in movies['genres'])
genres = sorted(set.union(*genre_iter))
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)

for i, gen in enumerate(movies['genres']):
# the following code report error
# TypeError: '['Animation', "Children's", 'Comedy']' is an invalid key
    dummies.iloc[i,dummies.columns.get_loc(gen.split('|'))] = 1
# while loc can run successfully
    dummies.loc[dummies.index[[i]],gen.split('|')] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

我对儿童错误的原因有所了解，但为什么动画，喜剧是错误的？我试过了：

dummies.columns.get_loc('Animation')

，结果是2.

Answer 1

这是一个非常简单（快速）的答案，使用字符串匹配，在这里应该可以正常工作，并且在任何情况下，类型名称都不会重叠。例如。如果你有类别＆＃34;犯罪＆＃34;和＃34;犯罪惊悚片＆＃34;然后，犯罪惊悚片将被归类为犯罪和犯罪惊悚片，而不仅仅是犯罪惊悚片。（但请参阅下面的说明，了解如何概括这一点。）

for g in genres:
    movies[g] = movies.genres.str.contains(g).astype(np.int8)

（注意使用np.int8而不是int将节省大量内存，因为int默认为64位而不是8位）

movies.head(2)的结果：

   movie_id             title                        genres  Action  \
0         1  Toy Story (1995)   Animation|Children's|Comedy       0   
1         2    Jumanji (1995)  Adventure|Children's|Fantasy       0   

   Adventure  Animation  Children's  Comedy  Crime  Documentary   ...     \
0          0          1           1       1      0            0   ...      
1          1          0           1       0      0            0   ...      

   Fantasy  Film-Noir  Horror  Musical  Mystery  Romance  Sci-Fi  Thriller  \
0        0          0       0        0        0        0       0         0   
1        1          0       0        0        0        0       0         0   

   War  Western  
0    0        0  
1    0        0

以上对上述代码的概括可能有些过分，但会给你一种更通用的方法，以避免对类型类别进行潜在的重复计算（例如将犯罪和犯罪惊悚等同起来）：

# add '|' delimiter to beginning and end of the genres column
movies['genres2'] = '|' + movies['genres'] + '|'

# search for '|Crime|' rather than 'Crime' which is much safer b/c
# we don't match a category which merely contains 'Crime', we 
# only match 'Crime' exactly
for g in genres:
    movies[g+'2'] movies.genres2.str.contains('\|'+g+'\|').astype(np.int8)

（如果您使用正则表达式比我更好，那么您不需要在开头和结尾添加＆＃39; |＆＃39; - <）

Answer 2

尝试

dummies = movies.genres.str.get_dummies()

使用iloc创建虚拟变量时的TypeError

2 个答案: