我有这种df:
df = pd.DataFrame({"text_column" : ['question: everybody is kongfu fighting', 'panda: of course', 'question: Why is the world so great ?', 'friend: Everybody is smart', 'and everybody is cool', 'enemy: no that is just not true', 'jordan: i want to add one thing: please', 'do not talk about this.', ' 2nd question : are you sure ?', 'yeah sure' ]})
text_column
0 question: everybody is kongfu fighting
1 panda: of course
2 question: Why is the world so great ?
3 friend: Everybody is smart
4 and everybody is cool
5 enemy: no that is just not true
6 jordan: i want to add one thing: please
7 do not talk about this.
8 2nd question : are you sure ?
9 messi: yeah sure
10 question: you are sure about this ?
11 donald: youre questions are stupid!
我想要以下输出
type_column new_text_column
0 question: panda: everybody is kongfu fighting of course
1 question: friend: enemy: jordan: 2nd question : messi: Why is the world so great ? Everybody is smart and everybody is cool no that is just not true i want to add one thing: please do not talk about this. are you sure ? yeah sure
2 question: donald: youre questions are stupid!
基本上,每个问题和答案(主题)都必须位于一个单元格中。 我可以编写一个有效但可以使用apply的函数,这通常不是最佳解决方案。 有人知道怎么做吗?
答案 0 :(得分:1)
定义以下功能:
将源文本字段“特殊化”分为两部分:
def mySplit(txt):
tbl = re.split(': ?', txt, 1)
if len(tbl) == 1:
tbl.insert(0, '')
return pd.Series(tbl, index=['Qn', 'Ans'])
重新格式化一组行:
def reformat(grp):
t1 = ': '.join(grp.Qn.tolist()) + ':'
t2 = ' '.join(grp.Ans.tolist())
return pd.Series([t1, t2], index=['type_column', 'new_text_column'])
然后,使结果运行:
df.text_column.apply(mySplit)\
.groupby(df2.Qn.str.startswith('question').cumsum())\
.apply(reformat).reset_index(drop=True)
它执行:
答案 1 :(得分:0)
很难从示例中看出分离的标准。
我猜它正在冒号分裂,所以您可以尝试列表理解
df["type_column"] = [x.split(":")[0] for x in df["text_column"]]
df["new_text_column"] = [x.split(":")[1] for x in df["text_column"]]