I have a solution to a problem, that to my despair is somewhat slow, and I am seeking advice on how to speed up my solution (by adding vectorization or other clever methods). I have a dataframe that looks like this:
toy = pd.DataFrame([[1,'cv','c,d,e'],[2,'search','a,b,c,d,e'],[3,'cv','d']],
columns=['id','ch','kw'])
Output is:
The task is to break up kw
column into one (replicated) row per comma-separated entry in each string. Thus, what I wish to achieve is:
My initial solution is the following:
data = pd.DataFrame()
for x in toy.itertuples():
id = x.id; ch = x.ch; keys = x.kw.split(",")
data = data.append([[id, ch, x] for x in keys], ignore_index=True)
data.columns = ['id','ch','kw']
Problem is: it is slow for larger dataframes. My hope is that someone has encountered a similar problem before, and knows how to optimize my solution. I'm using python 3.4.x and pandas 0.19+ if that is of importance.
Thank you!
答案 0 :(得分:2)
您可以list
使用length
,然后DataFrame
获取constructor
。
上次使用str.split
和len
cols = toy.columns
splitted = toy['kw'].str.split(',')
l = splitted.str.len()
toy = pd.DataFrame({'id':np.repeat(toy['id'], l),
'ch':np.repeat(toy['ch'], l),
'kw':np.concatenate(splitted)})
toy = toy.reindex_axis(cols, axis=1)
print (toy)
id ch kw
0 1 cv c
0 1 cv d
0 1 cv e
1 2 search a
1 2 search b
1 2 search c
1 2 search d
1 2 search e
2 3 cv d
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 31738 3136023010 777150982 2318301701 44 3707934113
[2,] 1687741813 44 31738 1284682632 462137835 445275140
[3,] 44 123 123 31738 1215490197 123
创建新的31738 = 1 3 4
3136023010 = 2
777150982 = 3
44 = 1 2 3
....
123 = 2 3 6
:
ORA-01940: cannot drop a user that is currently connected