Question

我正在使用Pandas和SciKit-Learn做一些基本的数据清理然后ML。我有一个words_df DataFrame，它是983行x 33,600列。这些列主要来自运行TFIDF，如下所示：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
corpus = result_df['_text'].tolist()
count_vect = CountVectorizer(min_df=1, stop_words='english')
dtm = count_vect.fit_transform(corpus)
word_counts = dtm.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(word_counts)
words_df = pd.DataFrame(tfidf.todense(), columns=count_vect.get_feature_names())

我提取了一个X和一个Y（输入实例及其目标值，在我的例子中是页面视图）。 X是一个DataFrame，Y是一个系列（我只使用words_df['_pageviews']）。

然后我跑了：

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

不幸的是，我收到了这个错误：

TypeError: Expected sequence or array-like, got estimator      _title

这是因为我的一个列名为_title吗？我不确定还有什么可能导致这个错误。

谢谢！

SciKit-Learn：麻烦使用train_test_split

0 个答案: