Question

I have a solution to a problem, that to my despair is somewhat slow, and I am seeking advice on how to speed up my solution (by adding vectorization or other clever methods). I have a dataframe that looks like this:

toy = pd.DataFrame([[1,'cv','c,d,e'],[2,'search','a,b,c,d,e'],[3,'cv','d']],
                   columns=['id','ch','kw'])

Output is:

The task is to break up kw column into one (replicated) row per comma-separated entry in each string. Thus, what I wish to achieve is:

My initial solution is the following:

data = pd.DataFrame()
for x in toy.itertuples():
    id = x.id; ch = x.ch; keys = x.kw.split(",")
    data = data.append([[id, ch, x] for x in keys], ignore_index=True)
data.columns = ['id','ch','kw']

Problem is: it is slow for larger dataframes. My hope is that someone has encountered a similar problem before, and knows how to optimize my solution. I'm using python 3.4.x and pandas 0.19+ if that is of importance.

Thank you!

Answer 1

您可以list使用length，然后DataFrame获取constructor。

上次使用str.split和len cols = toy.columns splitted = toy['kw'].str.split(',') l = splitted.str.len() toy = pd.DataFrame({'id':np.repeat(toy['id'], l), 'ch':np.repeat(toy['ch'], l), 'kw':np.concatenate(splitted)}) toy = toy.reindex_axis(cols, axis=1) print (toy) id ch kw 0 1 cv c 0 1 cv d 0 1 cv e 1 2 search a 1 2 search b 1 2 search c 1 2 search d 1 2 search e 2 3 cv d [,1] [,2] [,3] [,4] [,5] [,6] [1,] 31738 3136023010 777150982 2318301701 44 3707934113 [2,] 1687741813 44 31738 1284682632 462137835 445275140 [3,] 44 123 123 31738 1215490197 123创建新的31738 = 1 3 4 3136023010 = 2 777150982 = 3 44 = 1 2 3 .... 123 = 2 3 6：

ORA-01940: cannot drop a user that is currently connected

Python: break up dataframe (one row per entry in column, instead of multiple entries in column)

1 个答案: