Question

我有一个非常大的数据框，有1,000列。前几列仅出现一次，表示一个客户。接下来的几列代表与客户的多次相遇，并带有下划线和相遇数。每遇到一次其他情况，都会添加一个新列，因此列的数目不是固定的-它会随着时间的增长而增加。

示例数据帧头结构摘录：

id    dob    gender    pro_1    pro_10   pro_11   pro_2 ... pro_9    pre_1   pre_10   ...

我正在尝试根据列名后的数字对列进行重新排序，因此所有_1应该在一起，所有_2应该在一起，依此类推，

id    dob    gender    pro_1    pre_1    que_1    fre_1    gen_1    pro2    pre_2    que_2    fre_2    ...

（请注意，重新排序应正确地对数字进行排序；当前的顺序将它们像字符串一样对待，其顺序为1、10、11等，而不是1、2、3）

这是否有可能在大熊猫中进行，还是我应该看看别的东西？任何帮助将不胜感激！谢谢！

编辑：

或者，是否还可以根据列名的字符串部分AND数字部分重新排列列名？因此，输出将看起来与原始输出相似，不同之处在于将考虑数字，以便顺序更直观：

id    dob    gender    pro_1    pro_2    pro_3    ...    pre_1    pre_2    pre_3   ...

EDIT 2.0：

只想感谢大家的帮助！虽然只有一种回应有效，但我真的很感谢所做的努力，并从其他方法/方式中学到了很多东西。

Answer 1

幸运的是，在python中有一个内衬可以解决此问题：

df = df.reindex(sorted(df.columns), axis=1)

例如，假设您拥有此数据框：

将熊猫作为pd导入将numpy导入为np

df = pd.DataFrame({'Name': [2, 4, 8, 0],
                   'ID': [2, 0, 0, 0],
                   'Prod3': [10, 2, 1, 8],
                   'Prod1': [2, 4, 8, 0],
                   'Prod_1': [2, 4, 8, 0],
                   'Pre7': [2, 0, 0, 0],
                   'Pre2': [10, 2, 1, 8],
                   'Pre_2': [10, 2, 1, 8],
                   'Pre_9': [10, 2, 1, 8]}
                   )

print(df)

输出：

   Name  ID  Prod3  Prod1  Prod_1  Pre7  Pre2  Pre_2  Pre_9
0     2   2     10      2       2     2    10     10     10
1     4   0      2      4       4     0     2      2      2
2     8   0      1      8       8     0     1      1      1
3     0   0      8      0       0     0     8      8      8

然后使用

df = df.reindex(sorted(df.columns), axis=1)

然后数据框将如下所示：

   ID  Name  Pre2  Pre7  Pre_2  Pre_9  Prod1  Prod3  Prod_1
0   2     2    10     2     10     10      2     10       2
1   0     4     2     0      2      2      4      2       4
2   0     8     1     0      1      1      8      1       8
3   0     0     8     0      8      8      0      8       0

如您所见，不带下划线的列将排在最前面，然后根据下划线后的数字进行排序。但是，这也会对列名称进行排序，因此，字母表中排在第一位的列名称将排在第一位。

Answer 2

您需要在'_'上拆分列，然后转换为int：

c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']

df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)

输出：

   A_1  B_1  A_2  B_2  A_3  B_3  A_10  B_10
0   68   11   59   69   37   68    76    17
1   19   37   52   54   23   93    85     3

下一种情况，您需要human sorting：

import re
def atoi(text):
    return int(text) if text.isdigit() else text

def natural_keys(text):
    '''
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)
    '''
    return [ atoi(c) for c in re.split(r'(\d+)', text) ]



df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)

输出：

   A_1  A_2  A_3  A_10  B_1  B_2  B_3  B_10
0   68   59   37    76   11   69   68    17
1   19   52   23    85   37   54   93     3

Answer 3

这是您可以尝试的一种方法：

# column names copied from your example
example_cols = 'id    dob    gender    pro_1    pro_10   pro_11   pro_2  pro_9    pre_1   pre_10'.split()

# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
#   id  dob  gender  pro_1  pro_10  pro_11  pro_2  pro_9  pre_1  pre_10
#0   0    1       2      3       4       5      6      7      8       9

# number of columns excluded from sorting
N = 3

# get a list of columns from the dataframe
cols = df.columns.tolist()

# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]

# get the new dataframe based on the cols_new
df_new = df[cols_new]
#   id  dob  gender  pre_1  pre_10  pro_1  pro_2  pro_9  pro_10  pro_11
#0   0    1       2      8       9      3      6      7       4       5

Answer 4

尝试一下。

根据列名后面的数字对列重新排序

cols_fixed = df.columns[:3]  # change index no based on your df
cols_variable = df.columns[3:]  # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1]))  # split based on the number after '_'
cols_new = cols_fixed + cols_variable 
new_df = pd.DataFrame(df[cols_new])

根据列名的字符串部分和数字部分重新排列列名

cols_fixed = df.columns[:3]  # change index no based on your df
cols_variable = df.columns[3:]  # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable 
new_df = pd.DataFrame(df[cols_new])

按列名中嵌入的数字对组中的列重新排序？

4 个答案: