我有一个像这样的pandas数据框,
id desc
1 Description
1 02.09.2017 15:00 abcd
1 this is a sample description
1 which is continued here also
1
1 Description
1 01.09.2017 12:00 absd
1 this is another sample description
1 which might be continued here
1 or here
1
2 Description
2 09.03.2017 12:00 abcd
2 another sample again
2 and again
2
2 Description
2 08.03.2017 12:00 abcd
2 another sample again
2 and again times two
基本上,有一个id,行包含非结构化格式的信息。我想提取最后一个"描述"之后的描述。行和存储在一行。结果数据框看起来像这样:
id desc
1 this is another sample description which might be continued here or here
2 another sample again and again times two
从我能够想到的情况来看,我可能不得不使用groupby,但在此之后我不知道该怎么做。
答案 0 :(得分:1)
提取上一个Description
的位置,并使用str.cat
In [2840]: def lastjoin(x):
...: pos = x.desc.eq('Description').cumsum().idxmax()
...: return x.desc.loc[pos+2:].str.cat(sep=' ')
...:
In [2841]: df.groupby('id').apply(lastjoin)
Out[2841]:
id
1 this is another sample description which might...
2 another sample again and again times two
dtype: object
让列使用reset_index
In [3216]: df.groupby('id').apply(lastjoin).reset_index(name='desc')
Out[3216]:
id desc
0 1 this is another sample description which might...
1 2 another sample again and again times two