Question

我有这种df：

df = pd.DataFrame({"text_column" : ['question: everybody is kongfu fighting', 'panda: of course',  'question: Why is the world so great ?', 'friend: Everybody is smart', 'and everybody is cool', 'enemy: no that is just not true', 'jordan: i want to add one thing: please', 'do not talk about this.', ' 2nd question : are you sure ?', 'yeah sure' ]})

                                text_column
0   question: everybody is kongfu fighting
1   panda: of course
2   question: Why is the world so great ?
3   friend: Everybody is smart
4   and everybody is cool
5   enemy: no that is just not true
6   jordan: i want to add one thing: please
7   do not talk about this.
8    2nd question : are you sure ?
9   messi: yeah sure
10  question: you are sure about this ?
11  donald: youre questions are stupid!

我想要以下输出

                 type_column                                     new_text_column
0  question: panda:                                        everybody is kongfu fighting of course

1  question: friend: enemy: jordan: 2nd question : messi:  Why is the world so great ? Everybody is smart and everybody is cool no that is just not true i want to add one thing: please do not talk about this. are you sure ? yeah sure
2  question: donald:                                       youre questions are stupid!

基本上，每个问题和答案（主题）都必须位于一个单元格中。我可以编写一个有效但可以使用apply的函数，这通常不是最佳解决方案。有人知道怎么做吗？

Answer 1

定义以下功能：

将源文本字段“特殊化”分为两部分：

def mySplit(txt):
    tbl = re.split(': ?', txt, 1)
    if len(tbl) == 1:
        tbl.insert(0, '')
    return pd.Series(tbl, index=['Qn', 'Ans'])

重新格式化一组行：

def reformat(grp):
    t1 = ': '.join(grp.Qn.tolist()) + ':'
    t2 = ' '.join(grp.Ans.tolist())
    return pd.Series([t1, t2], index=['type_column', 'new_text_column'])

然后，使结果运行：

df.text_column.apply(mySplit)\
    .groupby(df2.Qn.str.startswith('question').cumsum())\
    .apply(reformat).reset_index(drop=True)

它执行：

将 text_column 分为两列（ Qn 和 Ans ）。
从每行以 Qn 开始的分组，从问题开始。
对每个组应用重新格式化。
重置索引（丢弃旧索引）。

Answer 2

很难从示例中看出分离的标准。

我猜它正在冒号分裂，所以您可以尝试列表理解

df["type_column"] = [x.split(":")[0] for x in df["text_column"]]
df["new_text_column"] = [x.split(":")[1] for x in df["text_column"]]

根据条件组合熊猫文本行

2 个答案: