Python: break up dataframe (one row per entry in column, instead of multiple entries in column)

时间:2017-06-09 12:56:12

标签: performance pandas python-3.4

I have a solution to a problem, that to my despair is somewhat slow, and I am seeking advice on how to speed up my solution (by adding vectorization or other clever methods). I have a dataframe that looks like this:

toy = pd.DataFrame([[1,'cv','c,d,e'],[2,'search','a,b,c,d,e'],[3,'cv','d']],
                   columns=['id','ch','kw'])

Output is:

enter image description here

The task is to break up kw column into one (replicated) row per comma-separated entry in each string. Thus, what I wish to achieve is:

enter image description here

My initial solution is the following:

data = pd.DataFrame()
for x in toy.itertuples():
    id = x.id; ch = x.ch; keys = x.kw.split(",")
    data = data.append([[id, ch, x] for x in keys], ignore_index=True)
data.columns = ['id','ch','kw']

Problem is: it is slow for larger dataframes. My hope is that someone has encountered a similar problem before, and knows how to optimize my solution. I'm using python 3.4.x and pandas 0.19+ if that is of importance.

Thank you!

1 个答案:

答案 0 :(得分:2)

您可以list使用length,然后DataFrame获取constructor

上次使用str.splitlen cols = toy.columns splitted = toy['kw'].str.split(',') l = splitted.str.len() toy = pd.DataFrame({'id':np.repeat(toy['id'], l), 'ch':np.repeat(toy['ch'], l), 'kw':np.concatenate(splitted)}) toy = toy.reindex_axis(cols, axis=1) print (toy) id ch kw 0 1 cv c 0 1 cv d 0 1 cv e 1 2 search a 1 2 search b 1 2 search c 1 2 search d 1 2 search e 2 3 cv d [,1] [,2] [,3] [,4] [,5] [,6] [1,] 31738 3136023010 777150982 2318301701 44 3707934113 [2,] 1687741813 44 31738 1284682632 462137835 445275140 [3,] 44 123 123 31738 1215490197 123 创建新的31738 = 1 3 4 3136023010 = 2 777150982 = 3 44 = 1 2 3 .... 123 = 2 3 6

ORA-01940: cannot drop a user that is currently connected