根据条件组合熊猫文本行

时间:2019-12-11 20:57:19

标签: python-3.x string pandas text

我有这种df:

df = pd.DataFrame({"text_column" : ['question: everybody is kongfu fighting', 'panda: of course',  'question: Why is the world so great ?', 'friend: Everybody is smart', 'and everybody is cool', 'enemy: no that is just not true', 'jordan: i want to add one thing: please', 'do not talk about this.', ' 2nd question : are you sure ?', 'yeah sure' ]})

                                text_column
0   question: everybody is kongfu fighting
1   panda: of course
2   question: Why is the world so great ?
3   friend: Everybody is smart
4   and everybody is cool
5   enemy: no that is just not true
6   jordan: i want to add one thing: please
7   do not talk about this.
8    2nd question : are you sure ?
9   messi: yeah sure
10  question: you are sure about this ?
11  donald: youre questions are stupid!

我想要以下输出

                 type_column                                     new_text_column
0  question: panda:                                        everybody is kongfu fighting of course

1  question: friend: enemy: jordan: 2nd question : messi:  Why is the world so great ? Everybody is smart and everybody is cool no that is just not true i want to add one thing: please do not talk about this. are you sure ? yeah sure
2  question: donald:                                       youre questions are stupid!

基本上,每个问题和答案(主题)都必须位于一个单元格中。 我可以编写一个有效但可以使用apply的函数,这通常不是最佳解决方案。 有人知道怎么做吗?

2 个答案:

答案 0 :(得分:1)

定义以下功能:

  1. 将源文本字段“特殊化”分为两部分:

    def mySplit(txt):
        tbl = re.split(': ?', txt, 1)
        if len(tbl) == 1:
            tbl.insert(0, '')
        return pd.Series(tbl, index=['Qn', 'Ans'])
    
  2. 重新格式化一组行:

    def reformat(grp):
        t1 = ': '.join(grp.Qn.tolist()) + ':'
        t2 = ' '.join(grp.Ans.tolist())
        return pd.Series([t1, t2], index=['type_column', 'new_text_column'])
    

然后,使结果运行:

df.text_column.apply(mySplit)\
    .groupby(df2.Qn.str.startswith('question').cumsum())\
    .apply(reformat).reset_index(drop=True)

它执行:

  • text_column 分为两列( Qn Ans )。
  • 从每行以 Qn 开始的分组,从问题开始。
  • 对每个组应用重新格式化
  • 重置索引(丢弃旧索引)。

答案 1 :(得分:0)

很难从示例中看出分离的标准。

我猜它正在冒号分裂,所以您可以尝试列表理解

df["type_column"] = [x.split(":")[0] for x in df["text_column"]]
df["new_text_column"] = [x.split(":")[1] for x in df["text_column"]]