Question

I have a clustered a DataFrame and then used groupby to group it by the resulting 'clusters' value

clusterGroup = df1.groupby('clusters')

Each group in clusterGroup has multiple rows (and ~30 columns) and I need to create a new dataframe of a single row for each group that is that represents the cluster center for each group. I'm using Kmeans to do this, specifically ".cluster_centers_" The idea was to loop through each group and calculate the cluster center then append this to a new dataframe called logCenters.

df1.head()

9367    13575   13577   13578   13580   13585   13587   13588   13589   13707   13708   13719   13722   13725   13817   13819   14894   20326   20379   20384   20431   20433   22337   22346   22386   22388   22391   clusters
493 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 105.0   0.0 0.0 0.0 0.0 0.0 0.0 112.0   0.0 107.0   0.0 0.0 0.0 14
510 0.0 0.0 0.0 113.0   0.0 0.0 111.0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 105.0   0.0 0.0 0.0 0.0 0.0 26
513 0.0 0.0 0.0 114.0   0.0 0.0 106.0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 106.0   0.0 0.0 0.0 0.0 0.0 26
516 0.0 0.0 0.0 114.0   0.0 0.0 111.0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 108.0   0.0 0.0 0.0 0.0 0.0 26
519 0.0 0.0 0.0 113.0   0.0 0.0 113.0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 109.0   0.0 0.0 0.0 0.0 0.0 26

.

    from sklearn.cluster import KMeans
K = 1
logCenters = []
for x in clusterGroup:
    kmeans_model = KMeans(n_clusters=K).fit(x)
    centers = np.array(kmeans_model.cluster_centers_)
    logCenters.append(centers)

The error I get when running this loop is:

    ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-108-148e4053f5fb> in <module>()
      3 logCenters = []
      4 for x in clusterGroup:
----> 5     kmeans_model = KMeans(n_clusters=K).fit(x)
      6     centers = np.array(kmeans_model.cluster_centers_)
      7     logCenters.append(centers)

/home/nbuser/anaconda3_23/lib/python3.4/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y)
    878         """
    879         random_state = check_random_state(self.random_state)
--> 880         X = self._check_fit_data(X)
    881 
    882         self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ = \

/home/nbuser/anaconda3_23/lib/python3.4/site-packages/sklearn/cluster/k_means_.py in _check_fit_data(self, X)
    852     def _check_fit_data(self, X):
    853         """Verify that the number of samples given is larger than k"""
--> 854         X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
    855         if X.shape[0] < self.n_clusters:
    856             raise ValueError("n_samples=%d should be >= n_clusters=%d" % (

/home/nbuser/anaconda3_23/lib/python3.4/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: setting an array element with a sequence.

Answer 1

clusterGroup = df1.groupby('clusters') returns an object see here

sklearn works with numpy arrays or pandas dataframes

but you're trying to feed it tuples. Hence the Error : ValueError: setting an array element with a sequence. refer this

try to convert it back to a dataframe, may be refer this here to debug

for loop to find cluster centers

1 个答案: