How to merge strings pandas df

时间:2018-07-25 04:31:12

标签: python pandas merge

I am trying merge specific strings in a pandas df. The df below is just an example. The values in my df will differ but the basic rules will apply. I basically want to merge all rows until there's a 4 letter string.

Whilst the 4 letter string in this df is always Excl, my df will contain numerous 4 letter strings.

import pandas as pd

d = ({
    'A' : ['Include','Inclu','Incl','Inc'],
    'B' : ['Excl','de','ude','l'],           
    'C' : ['X','Excl','Excl','ude'],
    'D' : ['','Y','ABC','Excl'],
    })

df = pd.DataFrame(data=d)

Out:

         A     B     C     D
0  Include  Excl     X      
1    Inclu    de  Excl     Y
2     Incl   ude  Excl   ABC
3      Inc     l   ude  Excl

Intended Output:

         A     B     C     D
0  Include  Excl     X      
1  Include        Excl     Y 
2  Include        Excl   ABC
3  Include              Excl

So row 0 stays the same as col B has 4 letters. Row 1 merges Col A,B as Col C 4 letters. Row 2 stays the same as above. Row 3 merges Col A,B,C as Col D has 4 letters.

I have tried to do this manually by merging all columns and then go back and removing unwanted values.

df["Com"] = df["A"].map(str) + df["B"]  + df["C"] 

But I would have to manually go through each row and remove different lengths of letters.

The above df is just an example. The central similarity is I need to merge everything before the 4 letter string.

3 个答案:

答案 0 :(得分:1)

尝试一下

抱歉,笨拙的解决方案,我会尝试改善性能,

temp=df.eq('Excl').shift(-1,axis=1)
df['end']= temp.apply(lambda x:x.argmax(),axis=1)
res=df.apply(lambda x:x.loc[:x['end']].sum(),axis=1)
mask=temp.replace(False,np.NaN).fillna(method='ffill').fillna(False).astype(bool)
del df['end']
df[:]=np.where(mask,'',df)
df['A']=res
print df

输出:

         A     B     C     D
0  Include  Excl     X      
1  Include        Excl     Y
2  Include        Excl   ABC
3  Include              Excl

改进的解决方案:

res= df.apply(lambda x:x.loc[:x.eq('Excl').shift(-1).argmax()].sum(),axis=1)
mask=df.eq('Excl').shift(-1,axis=1).replace(False,np.NaN).fillna(method='ffill').fillna(False).astype(bool)
df[:]=np.where(mask,'',df)
df['A']=res

更简化的解决方案:

t=df.eq('Excl').shift(-1,axis=1)
res= df.apply(lambda x:x.loc[:x.eq('Excl').shift(-1).argmax()].sum(),axis=1)
df[:]=np.where(t.fillna(0).astype(int).cumsum() >= 1,'',df)
df['A']=res

答案 1 :(得分:1)

您可以做类似的事情

mask = (df.iloc[:, 1:].applymap(len) == 4).cumsum(1) == 0
df.A = df.A + df.iloc[:, 1:][mask].apply(lambda x: x.str.cat(), 1)
df.iloc[:, 1:] = df.iloc[:, 1:][~mask].fillna('')

答案 2 :(得分:0)

我给你一个粗略的方法, 在这里,我们正在查找“ Excl”的位置,并将其上的列值合并在一起,以获得所需的输出。

ls=[]
for i in range(len(df)):
    end=(df.loc[i,:].index[(df.loc[i,:]=='Excl')][0])
    ls.append(''.join(df.loc[i,:end].replace({'Excl':''}).values))
df['A']=ls