Question

我有一个pandas数据框，其中有27列用电量，第一列表示为期两年的日期和时间，其他列具有记录的两年中26座房屋的每小时用电量值。我正在做的是使用k均值聚类。每当我尝试在x轴上绘制日期并在y轴上绘制耗电量值时，我都会遇到一个问题，即x和y必须具有相同的大小。我尝试重塑，但问题仍未解决。

enter image description here

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import math
import datetime
data_consumption2 = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
data_consumption2['Timestamp'] = pd.to_datetime(data_consumption2['Timestamp'], unit='s')
X=data_consumption2.iloc[: , 1:26].values
X=np.nan_to_num(X)
np.concatenate(X)
date=data_consumption2.iloc[: , 0].values
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
C = kmeans.cluster_centers_
plt.scatter(X, R , s=40, c= kmeans.labels_.astype(float), alpha=0.7)
plt.scatter(C[:,0] , C[:,1] , marker='*' , c='r', s=100)

我总是收到相同的错误消息，X和Y必须具有保存大小，尝试重塑数据。当我尝试重塑数据时，它无效，因为日期列的大小始终小于其余列的大小。

Answer 1

我认为您实质上是在对所有家庭进行时间序列聚类，以找到随时间变化的相似用电模式。

为此，每个时间戳都将成为一个“功能”，而每个家庭的用法将成为您的数据行。这将使应用sklearn聚类方法变得更容易，它们通常以method.fit(x)的形式出现，其中x表示要素（将数据传递为形状为(row, column)的2D数组）。因此，您的数据需要进行转置。

重构后的代码如下：

# what you have done 
import pandas as pd
df = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')

# this is to fill all the NaN values with 0
df.fillna(0,inplace=True)

# transpose the dataframe accordingly
df = df.set_index('Timestamp').transpose()
df.rename(columns=lambda x : x.strftime('%D %H:%M:%S'), inplace=True)
df.reset_index(inplace=True)
df.rename(columns={'index':'house_no'}, inplace=True)
df.columns.rename(None, inplace=True)
df.head()

您应该会看到类似这样的内容（不要介意显示的数据，我创建了一些与您的数据相似的虚拟数据）。

接下来，对于群集，这是您可以做的：

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(df.iloc[:,1:])
y_kmeans = kmeans.predict(df.iloc[:,1:])
C = kmeans.cluster_centers_

# add a new column to your dataframe that contains the predicted clusters
df['cluster'] = y_kmeans

最后，对于绘图，您可以使用下面的代码生成所需的散点图：

import matplotlib.pyplot as plt

color = ['red','green','blue']

plt.figure(figsize=(16,4))

for index, row in df.iterrows():
    plt.scatter(x=row.index[1:-1], y=row.iloc[1:-1], c=color[row.iloc[-1]], marker='x', alpha=0.7, s=40)

for index, cluster_center in enumerate(kmeans.cluster_centers_):
    plt.scatter(x=df.columns[1:-1], y=cluster_center, c=color[index], marker='o', s=100)

plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'All Clusters - Scatter', fontsize=20)
plt.show()

但是我建议为单个簇绘制线图，对我来说在视觉上更具吸引力：

plt.figure(figsize=(16,16))

for cluster_index in [0,1,2]:

    plt.subplot(3,1,cluster_index + 1)

    for index, row in df.iterrows():
        if row.iloc[-1] == cluster_index:
            plt.plot(row.iloc[1:-1], c=color[row.iloc[-1]], linestyle='--', marker='x', alpha=0.5)

    plt.plot(kmeans.cluster_centers_[cluster_index], c = color[cluster_index], marker='o', alpha=1)

    plt.xticks(rotation='vertical')
    plt.ylabel('Electricity Consumption')
    plt.title(f'Cluster {cluster_index}', fontsize=20)

plt.tight_layout()
plt.show()

干杯！

绘制熊猫数据框日期

1 个答案: