我有一个实现k-mean算法的函数,我想将它与DataFrames一起使用以考虑索引。目前我使用DataFrame.values并且它可以工作。但我没有得到输出的索引。
def cluster_points(X, mu):
clusters = {}
for x in X:
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
for i in enumerate(mu)], key=lambda t:t[1])[0]
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
def reevaluate_centers(mu, clusters):
newmu = []
keys = sorted(clusters.keys())
for k in keys:
newmu.append(np.mean(clusters[k], axis = 0))
return newmu
def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))
def find_centers(X, K):
# Initialize to K random centers
oldmu = random.sample(X, K)
mu = random.sample(X, K)
while not has_converged(mu, oldmu):
oldmu = mu
# Assign all points in X to clusters
clusters = cluster_points(X, mu)
# Reevaluate centers
mu = reevaluate_centers(oldmu, clusters)
return(mu, clusters)
例如,这样的例子是最小的和足够的:
import itertools
df = pd.DataFrame(np.random.randint(0,10,size=(10, 5)), index = list(range(10)), columns=list(range(5)))
df.index.name = 'subscriber_id'
df.columns.name = 'ad_id'
我明白了:
find_centers(df.values, 2)
([array([ 3.8, 3. , 3.6, 2. , 3.6]),
array([ 6.8, 3.6, 5.6, 6.8, 6.8])],
{0: [array([2, 0, 5, 6, 4]),
array([1, 1, 2, 3, 3]),
array([6, 0, 4, 0, 3]),
array([7, 9, 4, 1, 7]),
array([3, 5, 3, 0, 1])],
1: [array([6, 2, 5, 9, 6]),
array([8, 9, 7, 2, 8]),
array([7, 5, 3, 7, 8]),
array([7, 1, 5, 7, 6]),
array([6, 1, 8, 9, 6])]})
我有价值但没有索引。
答案 0 :(得分:1)
如果要获取包含索引的值数组,只需使用name
将索引添加到列中:
reset_index()
更新
如果您想要的是在输出上有索引,但在实际群集期间不使用它,则可以执行以下操作。首先,将实际数据框对象传递给values_with_index = df.reset_index().values
:
find_centers
然后按如下方式更改find_centers(df, 2)
:
cluster_points
输出中的中心仍然是数组,但集群将包含每行的系列对象。每个系列的def cluster_points(X, mu):
clusters = {}
for _, x in X.iterrows():
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]]))
for i in enumerate(mu)], key=lambda t:t[1])[0]
# You can replace this try/except block with
# clusters.setdefault(bestmukey, []).append(x)
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
属性是数据框中的索引值。