使用熊猫和正则表达式从csv文件的一列中提取一系列值

时间:2019-01-15 11:47:18

标签: python regex pandas csv

我有一个这样的csv文件:

step,"agent, para1, para2 , para3 , para4, para5"
0,[[0 4 0 1.0645870290796624 7 0.23158113372309874]]
1,[[0 4 1 1.0645870290796624 7 0.23158113372309874]]
2,[[1 4 2 1.0645870290796624 7 0.23158113372309874] [0 4 2 1.0645870290796624 7 0.23158113372309874]]
3,[[0 4 3 1.0645870290796624 7 0.23158113372309874] [1 4 3 1.0645870290796624 7 0.23158113372309874]]
4,[[1 4 4 1.0645870290796624 7 0.23158113372309874] [0 4 4 1.0645870290796624 7 0.23158113372309874]]
5,[[1 4 5 1.0645870290796624 7 0.23158113372309874] [0 4 5 1.0645870290796624 7 0.23158113372309874]]
6,[[0 4 6 1.0645870290796624 7 0.23158113372309874] [1 4 6 1.0645870290796624 7 0.23158113372309874]]
7,[[0 4 7 1.0645870290796624 7 0.23158113372309874] [1 4 7 1.0645870290796624 7 0.23158113372309874]]
8,[[0 4 8 1.0645870290796624 7 0.23158113372309874] [1 4 8 1.0645870290796624 7 0.23158113372309874]]
9,[[0 4 9 1.0645870290796624 7 0.23158113372309874] [1 4 9 1.0645870290796624 7 0.23158113372309874]]
10,[[2 4 10 1.0645870290796624 7 0.23158113372309874] [3 4 10 1.0645870290796624 7 0.23158113372309874] [0 4 10 1.0645870290796624 7 0.23158113372309874] [1 4 10 1.0645870290796624 7 0.23158113372309874]]

,我想在“ agent,para1,para2,para3,para4,para5”列中提取一个序列值,因此我可以写入一个新的csv文件,此列仅包含5个与int和浮点数并以特定的数字开头,例如,以下代码以0开头:

,step,"agent, para1, para2 , para3 , para4, para5"
0,0,0 4 0 1.0645870290796624 7 0.23158113372309874
1,1,0 4 1 1.0645870290796624 7 0.23158113372309874
2,2,0 4 2 1.0645870290796624 7 0.23158113372309874
3,3,0 4 3 1.0645870290796624 7 0.23158113372309874
4,4,0 4 4 1.0645870290796624 7 0.23158113372309874
5,5,0 4 5 1.0645870290796624 7 0.23158113372309874
6,6,0 4 6 1.0645870290796624 7 0.23158113372309874
7,7,0 4 7 1.0645870290796624 7 0.23158113372309874
8,8,0 4 8 1.0645870290796624 7 0.23158113372309874
9,9,0 4 9 1.0645870290796624 7 0.23158113372309874

这是我正在使用的代码:

import pandas as pd
import numpy as np

df = pd.read_csv('input.csv')
df['agent, para1, para2 , para3 , para4, para5']=
df['agent, para1, para2 , para3 , para4, para5'].str.extract(r'(0\s\d\s\d\s\d\.\d+\s\d\s\d\.\d+)',expand=False)

df.to_csv('input-modified.csv')

问题现在出在input-modified.csv中,如上所述,它仅包含10行数据,但是input.csv文件约为1G。如何改善规则表达式以从整个文件中提取数据?

0 个答案:

没有答案