I have a clustered a DataFrame and then used groupby to group it by the resulting 'clusters' value
clusterGroup = df1.groupby('clusters')
Each group in clusterGroup has multiple rows (and ~30 columns) and I need to create a new dataframe of a single row for each group that is that represents the cluster center for each group. I'm using Kmeans to do this, specifically ".cluster_centers_" The idea was to loop through each group and calculate the cluster center then append this to a new dataframe called logCenters.
df1.head()
9367 13575 13577 13578 13580 13585 13587 13588 13589 13707 13708 13719 13722 13725 13817 13819 14894 20326 20379 20384 20431 20433 22337 22346 22386 22388 22391 clusters
493 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 105.0 0.0 0.0 0.0 0.0 0.0 0.0 112.0 0.0 107.0 0.0 0.0 0.0 14
510 0.0 0.0 0.0 113.0 0.0 0.0 111.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 105.0 0.0 0.0 0.0 0.0 0.0 26
513 0.0 0.0 0.0 114.0 0.0 0.0 106.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 106.0 0.0 0.0 0.0 0.0 0.0 26
516 0.0 0.0 0.0 114.0 0.0 0.0 111.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 108.0 0.0 0.0 0.0 0.0 0.0 26
519 0.0 0.0 0.0 113.0 0.0 0.0 113.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 109.0 0.0 0.0 0.0 0.0 0.0 26
.
from sklearn.cluster import KMeans
K = 1
logCenters = []
for x in clusterGroup:
kmeans_model = KMeans(n_clusters=K).fit(x)
centers = np.array(kmeans_model.cluster_centers_)
logCenters.append(centers)
The error I get when running this loop is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-108-148e4053f5fb> in <module>()
3 logCenters = []
4 for x in clusterGroup:
----> 5 kmeans_model = KMeans(n_clusters=K).fit(x)
6 centers = np.array(kmeans_model.cluster_centers_)
7 logCenters.append(centers)
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y)
878 """
879 random_state = check_random_state(self.random_state)
--> 880 X = self._check_fit_data(X)
881
882 self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ = \
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/sklearn/cluster/k_means_.py in _check_fit_data(self, X)
852 def _check_fit_data(self, X):
853 """Verify that the number of samples given is larger than k"""
--> 854 X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
855 if X.shape[0] < self.n_clusters:
856 raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: setting an array element with a sequence.
答案 0 :(得分:-1)
clusterGroup = df1.groupby('clusters')
returns an object
see here
sklearn works with numpy
arrays or pandas
dataframes
but you're trying to feed it tuples. Hence the Error : ValueError: setting an array element with a sequence.
refer this
try to convert it back to a dataframe, may be refer this here to debug