Question

我的数据框很脏，需要清洗列。基本上，有很多列包含不应该包含的组合数据和轻微的拼写差异！例如：

         1    1/2    2c     2 c     
row
1       B     nan    C       nan 
2       B     nan    C       nan
3       nan   Rb     nan     nan
4       c     nan    nan     C

像这样：

         1    2c    
row
1       B     C       
2       B     C       
3       Rb    Rb   
4       c     C

因此问题是双重的，如何合并因模糊逻辑相似性而拆分的列，以及如何拆分然后合并具有组合值的列？

我知道如何执行此操作的唯一方法是创建一个新列，该新列使用.apply函数来应用if语句，但是鉴于列数在100s之内，这将很痛苦。有什么想法可以减少手动解决方案吗？

Answer 1

尝试

d0 = df.filter(regex='/')      # Grab the columns with "/" in name
d1 = df.drop(d0, 1)            # Drop those columns

a = d0.to_numpy()              
m = d0.columns.str.count('/')  # Count the number of "/".

d2 = pd.DataFrame(
    a.repeat(m + 1, axis=1),   # Repeat the columns one more time than the # of "/"
    d0.index,
    np.concatenate(d0.columns.str.split('/')) 
)

d3 = pd.concat([d1, d2], axis=1)  # Smash them back together

# Grab the first bit of the column name as long as they are digits
# Group by that and take the first non-null value
d3.groupby(np.ravel(d3.columns.str.extract('(\d+)')), axis=1).first()

    1   2
1   B   C
2   B   C
3  Rb  Rb
4   c   C

熊猫-如何拆分和合并名称相似的列？

1 个答案:

尝试