我的数据框很脏,需要清洗列。基本上,有很多列包含不应该包含的组合数据和轻微的拼写差异!例如:
1 1/2 2c 2 c
row
1 B nan C nan
2 B nan C nan
3 nan Rb nan nan
4 c nan nan C
像这样:
1 2c
row
1 B C
2 B C
3 Rb Rb
4 c C
因此问题是双重的,如何合并因模糊逻辑相似性而拆分的列,以及如何拆分然后合并具有组合值的列?
我知道如何执行此操作的唯一方法是创建一个新列,该新列使用.apply函数来应用if语句,但是鉴于列数在100s之内,这将很痛苦。有什么想法可以减少手动解决方案吗?
答案 0 :(得分:2)
d0 = df.filter(regex='/') # Grab the columns with "/" in name
d1 = df.drop(d0, 1) # Drop those columns
a = d0.to_numpy()
m = d0.columns.str.count('/') # Count the number of "/".
d2 = pd.DataFrame(
a.repeat(m + 1, axis=1), # Repeat the columns one more time than the # of "/"
d0.index,
np.concatenate(d0.columns.str.split('/'))
)
d3 = pd.concat([d1, d2], axis=1) # Smash them back together
# Grab the first bit of the column name as long as they are digits
# Group by that and take the first non-null value
d3.groupby(np.ravel(d3.columns.str.extract('(\d+)')), axis=1).first()
1 2
1 B C
2 B C
3 Rb Rb
4 c C