我有这样的CSV:
A B C D E F G
-- -- -- --------------------- --- -- --
G1 M1 C1 "2015-01-01 00:00:00" SR1 E1 N1
G1 M1 C1 "2015-01-01 00:00:00" SR1 E1 N2
G1 M1 C1 "2015-01-01 00:00:00" SR1 E1 N3
G2 M2 C1 "2015-01-01 00:00:00" SR1 E1 N1
G2 M2 C1 "1/1/2015 00:00:00" SR1 E1 N2
G2 M2 C1 "1/1/2015 00:00:00" SR1 E1 N3
我需要将其读入pandas df并忽略D列中的引号,以便我可以将其解析为日期时间列。我试图做以下事情:
df = pd.read_csv(
infile,
sep=r"\s*(?![0-9][0-9]:)",
skiprows=[1],
header=0,
quoting=csv.QUOTE_NONE
)
但是得到的df仍然有引号:
>>> df
A B C D E F G
0 G1 M1 C1 "2015-01-01 00:00:00" SR1 E1 N1
1 G1 M1 C1 "2015-01-01 00:00:00" SR1 E1 N2
2 G1 M1 C1 "2015-01-01 00:00:00" SR1 E1 N3
3 G2 M2 C1 "2015-01-01 00:00:00" SR1 E1 N1
4 G2 M2 C1 "1/1/2015 00:00:00" SR1 E1 N2
5 G2 M2 C1 "1/1/2015 00:00:00" SR1 E1 N3
如果我尝试直接将D列解析为日期时间列,则pandas会断开:
>>> pd.to_datetime(df.D)
...
ValueError: Unknown string format
如何让D列格式化,以便pandas可以将其解析为日期列?
熊猫版:0.19.2
答案 0 :(得分:3)
演示:
In [44]: df = pd.read_csv(r'D:\download\1.csv', delim_whitespace=True, skiprows=[1],
parse_dates=['D'])
In [45]: df
Out[45]:
A B C D E F G
0 G1 M1 C1 2015-01-01 SR1 E1 N1
1 G1 M1 C1 2015-01-01 SR1 E1 N2
2 G1 M1 C1 2015-01-01 SR1 E1 N3
3 G2 M2 C1 2015-01-01 SR1 E1 N1
4 G2 M2 C1 2015-01-01 SR1 E1 N2
5 G2 M2 C1 2015-01-01 SR1 E1 N3
In [46]: df.dtypes
Out[46]:
A object
B object
C object
D datetime64[ns]
E object
F object
G object
dtype: object
其中D:\download\1.csv
:
A B C D E F G
-- -- -- --------------------- --- -- --
G1 M1 C1 "2015-01-01 00:00:00" SR1 E1 N1
G1 M1 C1 "2015-01-01 00:00:00" SR1 E1 N2
G1 M1 C1 "2015-01-01 00:00:00" SR1 E1 N3
G2 M2 C1 "2015-01-01 00:00:00" SR1 E1 N1
G2 M2 C1 "1/1/2015 00:00:00" SR1 E1 N2
G2 M2 C1 "1/1/2015 00:00:00" SR1 E1 N3