按字符串拆分pandas数据帧

时间:2015-04-19 08:07:09

标签: python csv python-3.x pandas

我是使用Pandas数据帧的新手。我有.csv中的数据,如下所示:

foo, 1234,
bar, 4567
stuff, 7894
New Entry,,
morestuff,1345

我正在使用

将其读入数据框
 df = pd.read_csv

但是,每当我有一个“新条目”行(显然没有包括它)时,我真正想要的是一个新的数据帧(或一种分割当前的数据帧)。怎么可以这样做?

2 个答案:

答案 0 :(得分:1)

所以使用我连接3次的示例数据,加载后(为了方便,我将cols命名为' a'' b' c' c'然后我们找到你所拥有的指数' New Entry'并逐步产生这些位置的元组列表以标记乞讨,结束范围。

然后我们可以迭代这个元组列表并切片orig df并追加到列表中:

In [22]:

t="""foo,1234,
bar,4567
stuff,7894
New Entry,,
morestuff,1345"""
df = pd.read_csv(io.StringIO(t),header=None,names=['a','b','c'] )
df = pd.concat([df]*3, ignore_index=True)
df
Out[22]:
            a     b   c
0         foo  1234 NaN
1         bar  4567 NaN
2       stuff  7894 NaN
3   New Entry   NaN NaN
4   morestuff  1345 NaN
5         foo  1234 NaN
6         bar  4567 NaN
7       stuff  7894 NaN
8   New Entry   NaN NaN
9   morestuff  1345 NaN
10        foo  1234 NaN
11        bar  4567 NaN
12      stuff  7894 NaN
13  New Entry   NaN NaN
14  morestuff  1345 NaN
In [30]:

import itertools
idx = df[df['a'] == 'New Entry'].index
idx_list = [(0,idx[0])]
idx_list = idx_list + list(zip(idx, idx[1:]))
idx_list

​
Out[30]:
[(0, 3), (3, 8), (8, 13)]
In [31]:

df_list = []
for i in idx_list:  
    print(i)
    if i[0] == 0:
        df_list.append(df[i[0]:i[1]])
    else:
        df_list.append(df[i[0]+1:i[1]])
df_list
(0, 3)
(3, 8)
(8, 13)
Out[31]:
[       a     b   c
 0    foo  1234 NaN
 1    bar  4567 NaN
 2  stuff  7894 NaN,            a     b   c
 4  morestuff  1345 NaN
 5        foo  1234 NaN
 6        bar  4567 NaN
 7      stuff  7894 NaN,             a     b   c
 9   morestuff  1345 NaN
 10        foo  1234 NaN
 11        bar  4567 NaN
 12      stuff  7894 NaN]

答案 1 :(得分:1)

1)在逐行阅读文件的同时动态执行并检查NewEntry中断是一种方法。

2)其他方式,如果数据帧已经存在,则找到NewEntry并将数据帧切分为多个dff = {}

df                                                                 
        col1  col2  
0        foo  1234    
1        bar  4567                
2      stuff  7894                                                        
3   NewEntry   NaN                       
4  morestuff  1345 

查找NewEntry行,为边界条件添加[-1][len(df.index)]

rows = [-1] + np.where(df['col1']=='NewEntry')[0].tolist() + [len(df.index)]
[-1, 3L, 5]

创建数据帧的dict

dff = {}                                                                            
for i, r in enumerate(rows[:-1]):                                                   
    dff[i] = df[r+1: rows[i+1]]                                                     

数据帧的字典{0:datafram1,1:dataframe2}

dff                           
{0:     col1  col2            
 0    foo  1234               
 1    bar  4567               
 2  stuff  7894, 1:         col1  col2  
 4  morestuff  1345}

Dataframe 1

dff[0]              
    col1  col2      
0    foo  1234      
1    bar  4567      
2  stuff  7894      

Dataframe 2

dff[1]              
        col1  col2  
4  morestuff  1345