df = pd.DataFrame({"text_column" : ['question: everybody is kongfu fighting', 'panda: of course', 'question: Why is the world so great ?', 'friend: Everybody is smart', 'and everybody is cool', 'enemy: no that is just not true', 'jordan: i want to add one thing: please', 'do not talk about this.', ' 2nd question : are you sure ?', 'yeah sure' ]})
0 question: everybody is kongfu fighting
1 panda: of course
2 question: Why is the world so great ?
3 friend: Everybody is smart
4 and everybody is cool
5 enemy: no that is just not true
6 jordan: i want to add one thing: please
7 do not talk about this.
8 2nd question : are you sure ?
9 messi: yeah sure
10 question: you are sure about this ?
11 donald: youre questions are stupid!
type_column new_text_column
0 question: panda: everybody is kongfu fighting of course
1 question: friend: enemy: jordan: 2nd question : messi: Why is the world so great ? Everybody is smart and everybody is cool no that is just not true i want to add one thing: please do not talk about this. are you sure ? yeah sure
2 question: donald: youre questions are stupid!
基本上,每个问题和答案(主题)都必须位于一个单元格中。 我可以编写一个有效但可以使用apply的函数,这通常不是最佳解决方案。 有人知道怎么做吗?
答案 0 :(得分:1)
def mySplit(txt):
tbl = re.split(': ?', txt, 1)
if len(tbl) == 1:
tbl.insert(0, '')
return pd.Series(tbl, index=['Qn', 'Ans'])
def reformat(grp):
t1 = ': '.join(grp.Qn.tolist()) + ':'
t2 = ' '.join(grp.Ans.tolist())
return pd.Series([t1, t2], index=['type_column', 'new_text_column'])
答案 1 :(得分:0)
df["type_column"] = [x.split(":")[0] for x in df["text_column"]]
df["new_text_column"] = [x.split(":")[1] for x in df["text_column"]]