Question

我需要清理一个数据集，其中一些列（从.csv文件中读取）可能有多个名称，以逗号列出。

我需要在熊猫中做以下事情：

任何好的pandasian技巧呢？

这是一个简单的代码：

import pandas as pd
import numpy as np

df = pd.DataFrame(data=np.random.random(size=(5,6)), 
                  columns={'a', 'b', 'c, d', 'e', 'f, g', 'h'})


df=
          a         b      c, d         e      f, g         h
0  0.771418  0.371685  0.072876  0.153071  0.169513  0.399769
1  0.667551  0.886779  0.949341  0.869588  0.226275  0.273370
2  0.768456  0.945822  0.167757  0.584886  0.328152  0.246415
3  0.354713  0.690585  0.027916  0.237110  0.875449  0.430142
4  0.590518  0.819043  0.803876  0.909385  0.382452  0.867369

我需要：

df_new = 

          a         b         c         d         e         f         g         h
0  0.771418  0.371685  0.072876  0.072876  0.153071  0.169513  0.169513  0.399769
1  0.667551  0.886779  0.949341  0.949341  0.869588  0.226275  0.226275  0.273370
2  0.768456  0.945822  0.167757  0.167757  0.584886  0.328152  0.328152  0.246415
3  0.354713  0.690585  0.027916  0.027916  0.237110  0.875449  0.875449  0.430142
4  0.590518  0.819043  0.803876  0.803876  0.909385  0.382452  0.382452  0.867369

更新

如果我有重复的列名，会发生什么：

df = pd.DataFrame(data=np.random.random(size=(5,6)), 
                      columns={'a', 'b', 'c, d', 'c', 'f, g', 'h'})

并且期望的结果应该是

df_new_v2 =

          a         b         c         d       c.1         f         g         h
0  0.771418  0.371685  0.072876  0.072876  0.153071  0.169513  0.169513  0.399769
1  0.667551  0.886779  0.949341  0.949341  0.869588  0.226275  0.226275  0.273370
2  0.768456  0.945822  0.167757  0.167757  0.584886  0.328152  0.328152  0.246415
3  0.354713  0.690585  0.027916  0.027916  0.237110  0.875449  0.875449  0.430142
4  0.590518  0.819043  0.803876  0.803876  0.909385  0.382452  0.382452  0.867369

Answer 1

您可以按参数MultiIndex在列中创建第一个header，然后使用concat在第一个级别循环：

df = pd.read_csv(file, header=[0,1])

L = []
cols = df.columns.get_level_values(0)
for x in cols:
    c = df[x].columns.str.split(',')[0]
    a = pd.concat([df[x].squeeze()] * len(c), axis=1, keys=c)
    L.append(a)
df = pd.concat(L, axis=1, keys=cols)

使用样本数据：

df = pd.DataFrame(data=np.random.random(size=(5,6)), 
                      columns={'a', 'b', 'c, d', 'c', 'f, g', 'h'})

#print (df)
L = []
for x in df.columns:
    c = x.split(', ')
    a = pd.concat([df[x].squeeze()] * len(c), axis=1, keys=c)
    L.append(a)

df = pd.concat(L, axis=1)
s = df.columns.to_series()
df.columns = s + s.groupby(s).cumcount().astype(str).radd('.').str.replace('.0', '')

print (df)
          c         h         a       c.1         d         b         f  \
0  0.846482  0.285415  0.695800  0.497593  0.497593  0.159911  0.286545   
1  0.195390  0.369074  0.371147  0.102207  0.102207  0.924279  0.349958   
2  0.967811  0.059451  0.942390  0.826203  0.826203  0.722080  0.196833   
3  0.546076  0.789354  0.876819  0.243305  0.243305  0.391054  0.213517   
4  0.311528  0.544023  0.380844  0.308427  0.308427  0.511651  0.795380   

          g  
0  0.286545  
1  0.349958  
2  0.196833  
3  0.213517

Answer 2

这是一个非常简单的方法

x <- c(1,1,1,1,1,1,1,2,2,2,2,2,3,3,3,4,6,6,9,10,16,21)
y <- c(1,2,3,5,6,8,18,1,2,5,6,7,8,12,15,16,11,17,18,19,20,21)
z <- c(1,6,11)

在Pandas中复制一个具有多个名称的列

2 个答案: