我正在尝试使用SciKit训练随机森林模型学习使用简单的训练数据集:
Utm_term | DayOfWeek | Customers
Utm_term是文本,已使用BagOfWords方法转换为数组。 DayOfWeek的整数索引为0-6,客户为1或0。 这条线给我一个错误:
forest = forest.fit( train_data_features, train["customers"] )
train_data_features是一个包含utm_term和DayOfWeek列的数组。 print(train_data_features)
给出以下读出:
[ <10565x232 sparse matrix of type '<class 'numpy.int64'>'
with 23089 stored elements in Compressed Sparse Row format>
6 6 ..., 2 2 2]
这给了我一个错误:ValueError: setting an array element with a sequence.
我猜这是因为数组的utm_term部分是不规则的,但无法解决如何解决它 - 任何指针都非常感激。
完整代码如下:
import pandas as pd
train = pd.read_csv("customer_days.csv", header=0, \
delimiter=",", quoting=3)
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords # Import the stop word list
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from scipy.sparse import coo_matrix, hstack
def review_to_words( raw_review ):
#
# 1. Remove HTML
review_text = BeautifulSoup(raw_review).get_text()
#
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
#
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
#
stops = set(stopwords.words("english"))
#
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
#
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
#==============================================================================
# Get the number of queries based on the dataframe column size
num_reviews = train["utm_term"].size
# Loop over each query; create an index i that goes from 0 to the length
# of the query list
print ("Cleaning and parsing the training set movie reviews...\n")
# Initialize an empty list to hold the clean queries
clean_train_reviews = []
for i in range( 0, num_reviews ):
# If the index is evenly divisible by 1000, print a message
if( (i+1)%1000 == 0 ):
print ("Review %d of %d\n" % ( i+1, num_reviews ) )
clean_train_reviews.append( review_to_words( train["utm_term"][i] ))
print ("Creating the bag of words...\n")
# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words = None, \
max_features = 5000)
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of
# strings.
train_day=np.asarray(train["day"])
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = np.hstack([train_data_features,train_day])
print ("Training the random forest...")
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100)
# Fit the forest to the training set
print (train_data_features)
forest = forest.fit( train_data_features, train["customers"] )
完整的错误跟踪是:
Training the random forest...
Traceback (most recent call last):
File "<ipython-input-69-b1130cb33ae8>", line 1, in <module>
runfile('/Users/xxxx/.spyder-py3/temp.py', wdir='/Users/taylorda/.spyder-py3')
File "/Users/xxxx/anaconda/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "/Users/xxxx/anaconda/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/xxxx/.spyder-py3/temp.py", line 138, in <module>
forest = forest.fit( train_data_features, train["customers"] )
File "/Users/xxxx/anaconda/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 247, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "/Users/xxxx/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.