Question

我有一个像这样的pandas数据框，

id   desc
1    Description
1    02.09.2017 15:00 abcd
1    this is a sample description
1    which is continued here also
1    
1    Description
1    01.09.2017 12:00 absd
1    this is another sample description
1    which might be continued here
1    or here
1
2    Description
2    09.03.2017 12:00 abcd
2    another sample again
2    and again
2
2    Description
2    08.03.2017 12:00 abcd
2    another sample again
2    and again times two

基本上，有一个id，行包含非结构化格式的信息。我想提取最后一个＆＃34;描述＆＃34;之后的描述。行和存储在一行。结果数据框看起来像这样：

id  desc
1   this is another sample description which might be continued here or here
2   another sample again and again times two

从我能够想到的情况来看，我可能不得不使用groupby，但在此之后我不知道该怎么做。

Answer 1

提取上一个Description的位置，并使用str.cat

加入行

In [2840]: def lastjoin(x):
      ...:     pos = x.desc.eq('Description').cumsum().idxmax()
      ...:     return x.desc.loc[pos+2:].str.cat(sep=' ')
      ...:

In [2841]: df.groupby('id').apply(lastjoin)
Out[2841]:
id
1    this is another sample description which might...
2            another sample again and again times two
dtype: object

让列使用reset_index

In [3216]: df.groupby('id').apply(lastjoin).reset_index(name='desc')
Out[3216]:
   id                                               desc
0   1  this is another sample description which might...
1   2          another sample again and again times two

根据pandas中字符串的最后一次出现来选择行

1 个答案: