我有一个很大的csv文件,我想将其分组为行。每行大约一百万行,将分组为10000行。
文件的每一行要么是注释,要么以数字开头,然后是冒号,然后是可能包含更多冒号的值。
每行以0:
开头表示新组的开始,最后一行也有0:
示例:
# comment line
# comment line
0:
1:HELLO
2:WORLD
3:1.0
4:5.0
5:TRUE
0:
2:HEY
6:1
7:12
# COMMENT LINE
0:
1: FILE
3: 2.0
10: http://www.google.com
0:
我正在像这样将文件读取到DataFrame中。 (分隔符并不完美,但可以处理我拥有的数据)
df = pd.read_csv(FILENAME,
sep='(?<=\d):',
comment='#',
names=['col', 'val'],
engine='python')
这导致
col val
0 0
1 1 HELLO
2 2 WORLD
3 3 1.0
4 4 5.0
5 5 TRUE
6 0
7 2 HEY
8 6 1
9 7 12
10 0
11 1 FILE
12 3 2.0
13 10 http://www.google.com
14 0
应将其转换为
pd.DataFrame([
{1: "HELLO", 2: "WORLD", 3: 1.0, 4: 5.0, 5: "TRUE"},
{2: "HEY", 6: 1, 7: 12},
{1: "FILE", 3: 2.0, 10: "http://www.google.com"}
])
看起来像这样
1 2 3 4 5 6 7 10
0 HELLO WORLD 1.0 5.0 TRUE
1 HEY 1.0 12.0
2 FILE 2.0 http://www.google.com
关于如何进行分组的任何提示?
我可以使用read_csv c引擎按第一个冒号分隔行以加快速度吗?
答案 0 :(得分:1)
After reading your csv data try the following to get the desired output:
new = pd.concat([df.loc[i].set_index('col').T for i in np.split(df.index, np.where(df.col==0)[0])[1:]]).reset_index()
new.columns = new.columns.rename('')
del new['index']
print(new)
Output:
0 1 2 3 4 5 6 7 10
0 NaN HELLO WORLD 1.0 5.0 TRUE NaN NaN NaN
1 NaN NaN HEY NaN NaN NaN 1 12 NaN
2 NaN FILE NaN 2.0 NaN NaN NaN NaN http://www.google.com
Update This might be marginally faster by removing the need of using
.loc
pd.concat([i.T for i in np.split(df.set_index('col'), np.where(df.col == 0)[0])[1:]]).reset_index()