Question

我有一个很大的csv文件，我想将其分组为行。每行大约一百万行，将分组为10000行。

文件的每一行要么是注释，要么以数字开头，然后是冒号，然后是可能包含更多冒号的值。

每行以0:开头表示新组的开始，最后一行也有0:

示例：

# comment line
# comment line
0:
1:HELLO
2:WORLD
3:1.0
4:5.0
5:TRUE
0:
2:HEY
6:1
7:12
# COMMENT LINE
0: 
1: FILE
3: 2.0
10: http://www.google.com
0:

我正在像这样将文件读取到DataFrame中。（分隔符并不完美，但可以处理我拥有的数据）

df = pd.read_csv(FILENAME, 
                 sep='(?<=\d):', 
                 comment='#', 
                 names=['col', 'val'], 
                 engine='python')

这导致

    col val
0   0   
1   1   HELLO
2   2   WORLD
3   3   1.0
4   4   5.0
5   5   TRUE
6   0   
7   2   HEY
8   6   1
9   7   12
10  0   
11  1    FILE
12  3    2.0
13  10   http://www.google.com
14  0

应将其转换为

pd.DataFrame([
    {1: "HELLO", 2: "WORLD", 3: 1.0, 4: 5.0, 5: "TRUE"},
    {2: "HEY", 6: 1, 7: 12},
    {1: "FILE", 3: 2.0, 10: "http://www.google.com"}
])

看起来像这样

    1   2   3   4   5   6   7   10
0   HELLO   WORLD   1.0 5.0 TRUE            
1       HEY             1.0 12.0    
2   FILE        2.0                 http://www.google.com

关于如何进行分组的任何提示？

我可以使用read_csv c引擎按第一个冒号分隔行以加快速度吗？

Answer 1

After reading your csv data try the following to get the desired output:

new = pd.concat([df.loc[i].set_index('col').T for i in np.split(df.index, np.where(df.col==0)[0])[1:]]).reset_index()
new.columns = new.columns.rename('')
del new['index']
print(new)

Output:

    0   1       2       3     4     5    6      7    10
0   NaN HELLO   WORLD   1.0   5.0   TRUE NaN    NaN  NaN
1   NaN NaN     HEY     NaN   NaN   NaN  1      12   NaN
2   NaN FILE    NaN     2.0   NaN   NaN  NaN    NaN  http://www.google.com

Update This might be marginally faster by removing the need of using .loc

pd.concat([i.T for i in np.split(df.set_index('col'), np.where(df.col == 0)[0])[1:]]).reset_index()

Python Pandas：将键分组：将值csv导入行

1 个答案: