当我尝试使用自己的csv文件运行此代码段时:
data_df = pd.read_csv("movies_genres_en.csv", delimiter='\t')
# split the data, leave 1/3 out for testing
data_x = data_df[['plot']].as_matrix()
data_y = data_df.drop(['title', 'plot', 'plot_lang'], axis=1).as_matrix()
stratified_split = StratifiedShuffleSplit(n_splits=2, test_size=0.33)
for train_index, test_index in stratified_split.split(data_x, data_y):
x_train, x_test = data_x[train_index], data_x[test_index]
y_train, y_test = data_y[train_index], data_y[test_index]
#data_x = overviews,
#data_y = values from all the genre types('Action','Adventure', 'Fantasy') (1 0 0 ...)
# transform matrix of plots into lists to pass to a TfidfVectorizer
train_x = [x[0].strip() for x in x_train.tolist()]
test_x = [x[0].strip() for x in x_test.tolist()]
我一直收到这个错误:
ValueError:y中填充最少的类只有1个成员,即 太少了。任何类的最小组数不能少 比2。
我的所有实例都比2更有价值 这是关于我的csv值的信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4643 entries, 0 to 4642
Data columns (total 20 columns):
title 4643 non-null object
Action 4643 non-null object
Adventure 4643 non-null object
Fantasy 4643 non-null object
Science Fiction 4643 non-null object
Crime 4643 non-null object
Drama 4643 non-null object
Thriller 4643 non-null object
Animation 4643 non-null object
Family 4643 non-null object
Western 4643 non-null object
Comedy 4643 non-null object
Romance 4643 non-null object
Horror 4643 non-null object
Mystery 4643 non-null object
History 4643 non-null object
War 4643 non-null object
Music 4643 non-null object
Documentary 4643 non-null object
overview 4639 non-null object
dtypes: object(20)
memory usage: 725.5+ KB
csv文件的前五行:
data_df.head():