我使用以下代码来构建和准备我的熊猫数据框:
data = pd.read_csv('statistic.csv',
parse_dates=True, index_col=['DATE'], low_memory=False)
data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
data_extracted = data.groupby(['DATE','ARTICLENO'])
['QUANTITY'].sum().unstack()
#replace string nan with numpy data type
data_extracted = data_extracted.fillna(value=np.nan)
#remove footer of csv file
data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2],
errors="coerce")
#resample to one week rythm
data_resampled = data_extracted.resample('W-MON', label='left',
loffset=pd.DateOffset(days=1)).sum()
# reduce to one year
data_extracted = data_extracted.loc['2015-01-01' : '2015-12-31']
#fill possible NaNs with 1 (not 0, because of division by zero when doing
pct_change
data_extracted = data_extracted.replace([np.inf, -np.inf], np.nan).fillna(1)
data_pct_change =
data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf],
np.nan).fillna(0)
# actual dropping logic if column has no values at all
data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems()
if val == 0 ], axis=1, inplace=True)
normalized_modeling_data = preprocessing.normalize(data_pct_change,
norm='l2', axis=0)
normalized_data_headers = pd.DataFrame(normalized_modeling_data,
columns=data_pct_change.columns)
normalized_modeling_data = normalized_modeling_data.transpose()
kmeans = KMeans(n_clusters=3, random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
np.savetxt('log_2016.txt', kmeans.labels_, newline="\n")
for i, cluster_center in enumerate(kmeans.cluster_centers_):
plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.show()
不幸的是,我的数据框中有很多0(文章不在同一日期开始,因此,如果A从2015年开始而B从2016年开始,那么B在2015年全年将获得0) 这是分组的数据框:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 9.0 6.0 560.0 2736.0
2015-01-19 2.0 1.0 560.0 3312.0
2015-01-26 NaN 5.0 600.0 2196.0
2015-02-02 NaN NaN 40.0 3312.0
2015-02-16 7.0 6.0 520.0 5004.0
2015-02-23 12.0 4.0 480.0 4212.0
2015-04-13 11.0 6.0 920.0 4230.0
这里是相应的百分比变化:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 -0.777778 -0.833333 0.000000 0.210526
2015-01-26 -0.500000 4.000000 0.071429 -0.336957
2015-02-02 0.000000 -0.800000 -0.933333 0.508197
2015-02-16 6.000000 5.000000 12.000000 0.510870
2015-02-23 0.714286 -0.333333 -0.076923 -0.158273
在405659844106处的系数12为“正确” 这是我数据框中的另一个示例:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 2175.0 200.0 NaN NaN
2015-01-19 2550.0 NaN NaN NaN
2015-01-26 925.0 NaN NaN NaN
2015-02-02 675.0 NaN NaN NaN
2015-02-16 1400.0 200.0 120.0 NaN
2015-02-23 6125.0 320.0 NaN NaN
以及相应的百分比变化:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 0.172414 -0.995000 0.000000 -0.058824
2015-01-26 -0.637255 0.000000 0.000000 0.047794
2015-02-02 -0.270270 0.000000 0.000000 -0.996491
2015-02-16 1.074074 199.000000 119.000000 279.000000
2015-02-23 3.375000 0.600000 -0.991667 0.310714
如您所见,系数200-300的变化来自于被替换的NaN变为实际值。
此数据用于进行kmeans聚类,而这种“废话”数据破坏了我的kmeans中心。
有人知道如何删除此类列吗?
答案 0 :(得分:0)
我使用以下语句删除了无意义的列:
max_nan_value_count = 5
data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda
col: col.isnull().sum() > max_nan_value_count)], axis=1)