Python / SciKit Learn - 随机森林ValueError:使用序列设置数组元素

时间:2017-07-24 20:48:38

标签: python arrays machine-learning scikit-learn random-forest

我正在尝试使用SciKit训练随机森林模型学习使用简单的训练数据集:

Utm_term  | DayOfWeek  | Customers

Utm_term是文本,已使用BagOfWords方法转换为数组。 DayOfWeek的整数索引为0-6,客户为1或0。 这条线给我一个错误:

forest = forest.fit( train_data_features, train["customers"] )

train_data_features是一个包含utm_term和DayOfWeek列的数组。     print(train_data_features)

给出以下读出:

[ <10565x232 sparse matrix of type '<class 'numpy.int64'>'
       with 23089 stored elements in Compressed Sparse Row format>
 6 6 ..., 2 2 2]

这给了我一个错误:ValueError: setting an array element with a sequence.

我猜这是因为数组的utm_term部分是不规则的,但无法解决如何解决它 - 任何指针都非常感激。

完整代码如下:

import pandas as pd       
train = pd.read_csv("customer_days.csv", header=0, \
                    delimiter=",", quoting=3)


from bs4 import BeautifulSoup     
import re
import nltk
from nltk.corpus import stopwords # Import the stop word list
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from scipy.sparse import coo_matrix, hstack

def review_to_words( raw_review ):
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

 #==============================================================================


# Get the number of queries based on the dataframe column size
num_reviews = train["utm_term"].size


# Loop over each query; create an index i that goes from 0 to the length
# of the query list 
print ("Cleaning and parsing the training set movie reviews...\n")
# Initialize an empty list to hold the clean queries

clean_train_reviews = []
for i in range( 0, num_reviews ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print ("Review %d of %d\n" % ( i+1, num_reviews )   )                                                                 
    clean_train_reviews.append( review_to_words( train["utm_term"][i] ))


print ("Creating the bag of words...\n")

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.

train_day=np.asarray(train["day"])
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = np.hstack([train_data_features,train_day])

print ("Training the random forest...")

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 

# Fit the forest to the training set

print (train_data_features)

forest = forest.fit( train_data_features, train["customers"] )

完整的错误跟踪是:

Training the random forest...
Traceback (most recent call last):

  File "<ipython-input-69-b1130cb33ae8>", line 1, in <module>
    runfile('/Users/xxxx/.spyder-py3/temp.py', wdir='/Users/taylorda/.spyder-py3')

  File "/Users/xxxx/anaconda/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "/Users/xxxx/anaconda/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/xxxx/.spyder-py3/temp.py", line 138, in <module>
    forest = forest.fit( train_data_features, train["customers"] )

  File "/Users/xxxx/anaconda/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 247, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)

  File "/Users/xxxx/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: setting an array element with a sequence.

0 个答案:

没有答案