处理数据框省略了先前的行值

时间:2018-01-27 16:11:11

标签: python pandas

此代码:

import numpy as np
import pandas as pd

df = pd.DataFrame([['stop' , '1'], ['a1' , '2'], ['a1' , '3'], ['stop' , '4'], ['a2' , '5'], ['wildcard' , '6']] , columns=['a' , 'b'])

print(df)

打印:

          a  b
0  stop      1
1  a1        2
2  a1        3
3  stop      4
4  a2        5
5  wildcard  6

我尝试创建一个新的数据框,如果遇到停止,则会创建一个包含元组的新行,其中列的值为#a;'是元组的第一个元素,' b'是元组的后续元素。因此,对于转换后的df,新的df df_post结构是:

df_post = pd.DataFrame([['stop' , [('a1' , '2') , ('a1' , '3')]] , ['stop' , [('a2' , 5)]]] , columns=['a' , 'b'])

print(df_post)

      a                   b
0  stop  [(a1, 2), (a1, 3)]
1  stop  [(a2, 5)]         

通配符也是一个停止条件,如果遇到新行,就像之前一样插入df_post。

这是我到目前为止所做的:

df['stop_loc'] = ( (df['a'] == 'stop') | (df['a'] == 'wildcard') ).cumsum()
df_new = df[(df['a'] != 'stop') & (df['stop_loc'] != df['stop_loc'].max())].groupby('stop_loc').apply(lambda x: list(zip(x.a, x.b)))
df_new

呈现:

stop_loc
1    [(a1, 2), (a1, 3)]
2    [(a2, 5)]         
dtype: object

'停止'值不作为行插入。如何修改以便生成的数据框是

      a                   b
0  stop  [(a1, 2), (a1, 3)]
1  stop  [(a2, 5)]         

而不是:

stop_loc
1    [(a1, 2), (a1, 3)]
2    [(a2, 5)]         
dtype: object 

1 个答案:

答案 0 :(得分:1)

您正使用df['a'] != 'stop'过滤停止行。这是一个替代代码:

# df['stop_loc'] = ( (df['a'] == 'stop') | (df['a'] == 'wildcard') ).cumsum()
df['stop_loc'] = df['a'].isin(['stop', 'wildcard']).cumsum()

def zip_entries(x):
    return list(x.a)[0], list(zip(x.a[1:], x.b[1:]))

df_new = (df[(df['stop_loc'] != df['stop_loc'].max())]
          .groupby('stop_loc')
          .apply(zip_entries)
          .apply(pd.Series))

print(df_new)
#              0                   1
# stop_loc                          
# 1         stop  [(a1, 2), (a1, 3)]
# 2         stop           [(a2, 5)]