Question

示例数据：

我想要做的是将最大的群集ID替换为0，将第二个群集ID替换为1，依此类推。输出如下所示。

我不太确定从哪里开始。任何帮助将非常感激。

Answer 1

目标是将'cluster'列中定义的组重新标记该组中该组的总值计数的相应排名。我们将其分解为几个步骤：

整数分解。查找整数表示，其中列中的每个唯一值都有自己的整数。我们将从零开始。
然后我们需要这些唯一值的计数。
我们需要按照他们的计数对唯一值进行排名。
我们将排名分配回原始列的位置。

方法1
使用Numpy的numpy.unique + argsort

TL; DR

u, i, c = np.unique(
    df.cluster.values,
    return_inverse=True,
    return_counts=True
)
(-c).argsort()[i]

事实证明，numpy.unique执行整数分解的任务并一次性计算值。在此过程中，我们也获得了独特的价值观，但我们并不真正需要这些价值观。而且，整数分解并不明显。这是因为根据numpy.unique函数，我们要查找的返回值称为inverse。它被称为逆，因为它的目的是在给定唯一值数组的情况下取回原始数组。所以，如果我们让

u, i, c = np.unique(
    df.cluster.values,
    return_inverse=True,
    return_couns=True
)

您会看到i看起来像：

array([2, 2, 2, 2, 0, 0, 1, 1, 1, 3, 3, 4, 5])

如果我们u[i]，我们会收回原来的df.cluster.values

array([3, 3, 3, 3, 1, 1, 2, 2, 2, 4, 4, 5, 6])

但我们将把它用作整数分解。

接下来，我们需要计数c

array([2, 3, 4, 2, 1, 1])

我打算提议使用argsort，但这令人困惑。所以我会尝试展示它：

np.row_stack([c, (-c).argsort()])

array([[2, 3, 4, 2, 1, 1],
       [2, 1, 0, 3, 4, 5]])

argsort通常做的是放置顶点（位置0），即从原始数组中绘制的位置。

#            position 2
#            is best
#                |
#                v
# array([[2, 3, 4, 2, 1, 1],
#        [2, 1, 0, 3, 4, 5]])
#         ^
#         |
#     top spot
#     from
#     position 2

#        position 1
#        goes to
#        pen-ultimate spot
#            |
#            v
# array([[2, 3, 4, 2, 1, 1],
#        [2, 1, 0, 3, 4, 5]])
#            ^
#            |
#        pen-ultimate spot
#        from
#        position 1

这使我们能够做的是使用整数分解对这个argsort结果进行切片，以达到重新排名。

#     i is
#        [2 2 2 2 0 0 1 1 1 3 3 4 5]

#     (-c).argsort() is 
#        [2 1 0 3 4 5]

# argsort
# slice
#      \   / This is our integer factorization
#       a i
#     [[0 2]  <-- 0 is second position in argsort
#      [0 2]  <-- 0 is second position in argsort
#      [0 2]  <-- 0 is second position in argsort
#      [0 2]  <-- 0 is second position in argsort
#      [2 0]  <-- 2 is zeroth position in argsort
#      [2 0]  <-- 2 is zeroth position in argsort
#      [1 1]  <-- 1 is first position in argsort
#      [1 1]  <-- 1 is first position in argsort
#      [1 1]  <-- 1 is first position in argsort
#      [3 3]  <-- 3 is third position in argsort
#      [3 3]  <-- 3 is third position in argsort
#      [4 4]  <-- 4 is fourth position in argsort
#      [5 5]] <-- 5 is fifth position in argsort

然后我们可以将其放入pd.DataFrame.assign

u, i, c = np.unique(
    df.cluster.values,
    return_inverse=True,
    return_counts=True
)
df.assign(cluster=(-c).argsort()[i])

    id  cluster
0    1        0
1    2        0
2    3        0
3    4        0
4    5        2
5    6        2
6    7        1
7    8        1
8    9        1
9   10        3
10  11        3
11  12        4
12  13        5

方法2
我将利用相同的概念。但是，我会使用Pandas pandas.factorize来获取numpy.bincount的整数分解来计算值。使用这种方法的原因是因为Numpy的unique实际上在分解和计数中对值进行排序。 pandas.factorize没有。对于较大的数据集，大哦是我们的朋友，因为这仍然是O(n)，而Numpy方法是O(nlogn)。

i, u = pd.factorize(df.cluster.values)
c = np.bincount(i)
df.assign(cluster=(-c).argsort()[i])

    id  cluster
0    1        0
1    2        0
2    3        0
3    4        0
4    5        2
5    6        2
6    7        1
7    8        1
8    9        1
9   10        3
10  11        3
11  12        4
12  13        5

Answer 2

您可以使用groupby，transform和rank：

df['cluster'] = df.groupby('cluster').transform('count')\
                  .rank(ascending=False, method='dense')\
                  .sub(1).astype(int)

输出：

   id  cluster
0   1        0
1   2        0
2   3        0
3   4        0
4   5        2
5   6        2
6   7        1
7   8        1
8   9        1
9  10        3

Answer 3

使用category和value_counts

df.cluster.map((-df.cluster.value_counts()).astype('category').cat.codes
)
Out[151]: 
0    0
1    0
2    0
3    0
4    2
5    2
6    1
7    1
8    1
9    3
Name: cluster, dtype: int8

Answer 4

这不是最干净的解决方案，但确实有效。随意提出改进建议：

valueCounts = df.groupby('cluster')['cluster'].count()
valueCounts_sorted = df.sort_values(ascending=False)

for i in valueCounts_sorted.index.values:
    print (i)
    temp = df[df.cluster == i]
    temp["random"] = count
    idx = temp.index.values
    df.loc[idx, "cluster"] = temp.random.values

    count += 1

根据尺寸

4 个答案: