Question

我有一列数据框。该列的行包含通常跨多个行的对话框。每个人的对话行末尾都是相同的字符“＆，”组合，如下所示：

   Words
1  hello world! &,,
2  I woke up this morning and made some eggs.
3  They tasted good. &,,

我希望将不以“＆，”结尾的每一行与下一行合并，以使每一行都是不同的人在说话，而不是在同一段中有多行。看起来像这样：

   Words
1  hello world! &,,
2  I woke up this morning and made some eggs. They tasted good. &,,

我所看到的与此类似的每个问题都涉及到另一列，该列将指定一些额外的信息，例如，我可能会说谁在讲话，但是对于该数据集，我没有那个，也没有另一个具有更多信息的数据集信息，我只有分隔符。

Answer 1

您可以 join 您的值和定界符上的 split 来重新创建数据框：

df = pd.DataFrame(
    ''.join(df.Words.values)
    .split('&,,'), columns=['Words']
)

                                               Words
0                                      hello world!
1  I woke up this morning and made some eggs.They...
2

如果最后一列以&,,结尾，则这可能会导致空值，但过滤这些行很容易：

df.loc[df.Words.ne('')]

                                               Words
0                                      hello world!
1  I woke up this morning and made some eggs.They...

Answer 2

您可以使用df['Words'].str.endswith('&,,')查找哪些行以&,,结尾，然后使用cumsum生成所需的组号（存储在row列的下方）。一旦有了这些组号，就可以使用pd.pivot_table将DataFrame整形为所需的形式：

import sys
import pandas as pd
pd.options.display.max_colwidth = sys.maxsize

df = pd.DataFrame({
   'Words': ['hello world! &,,',
             'I woke up this morning and made some eggs.',
             'They tasted good. &,,']}, index=[1, 2, 3])

df['row'] = df['Words'].str.endswith('&,,').shift().fillna(0).cumsum() + 1
result = pd.pivot_table(df, index='row', values='Words', aggfunc=' '.join)
print(result)

收益

                                                                Words
row                                                                  
1                                                    hello world! &,,
2    I woke up this morning and made some eggs. They tasted good. &,,

熊猫：连接字符串直到特定字符为止

2 个答案: