我写了这个循环来解析一个100万行.csv文件。它可以工作,但只能处理大约7k线/分钟。有没有合理的方法可以让它更快地运行?
循环当前正在将数据块转换为行,并删除多余的字符,并将每一行写入新的.csv文件。
pattern = re.compile(r",{2,}")
with open("OceanData.csv") as infile, open("OceanParsed.csv","w", newline="") as fout:
outfile = csv.writer(fout)
data =[]
for line in infile:
if line.startswith("#--------------------------------------------------------------------------------"):
outfile.writerow(data)
continue
for ch in ["[","]","'"," ","\n"]:
if ch in line:
line = line.replace(ch,"")
for i in line:
line =re.sub(pattern,",", line)
continue
if not line: continue
data.append(line)
示例数据:http://www.sharecsv.com/s/674dc42035c29eb4f250b5c2365c8dc6/OceanParseTest.csv
答案 0 :(得分:3)
不要重新发明轮子来读取csv文件。
您可以使用pandas。
import pandas as pd
df = pd.read_csv('file.csv')
或者也使用csv标准库。
要读取大型csv文件,如果上述方法不起作用。您可以将文件拆分为小文件,创建一个读取每个文件的过程。
您的数据sample。
我认为您的格式文件不是csv文件。然后假设您有一个这样的部分:
#--------------------------------------------------------------------------------,,,,,,
CAST ,,9285001,WOD Unique Cast Number,WOD code,,
NODC Cruise ID ,,US-10209 ,,,,
Originators Station ID ,,82,,,integer,
Originators Cruise ID ,, ,,,,
Latitude ,,-76.477,decimal degrees,,,
Longitude ,,166.3137,decimal degrees,,,
Year ,,1997,,,,
Month ,,1,,,,
Day ,,1,,,,
Time ,,3.9931,decimal hours (UT),,,
METADATA,,,,,,
Country ,, US,NODC code,UNITED STATES,,
Accession Number ,,520,NODC code,,,
Project ,,406,NODC code,RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPON
SE IN ROSS SEA,,
Platform ,,3596,OCL code,NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c
.s.KUS1475;IMO900725,,
Institute ,,431,NODC code,US DOC NOAA NESDIS,,
Cast/Tow Number ,,1,,,,
High resolution CTD - Bottle,,9182488,,,,
probe_type ,,7,OCL_code,bottle/rossette/net,,
scale ,Temperature,103,WOD code,Temperature: ITS-90,,
Instrument ,Temperature,411,WOD code,CTD: SBE 911plus (Sea-Bird Electronics, Inc.),
VARIABLES ,Depth ,F,O,Temperatur ,F,O
UNITS ,m , , ,degrees C ,,
Prof-Flag , ,0, , ,0,
1,0,0, ,-1.591,0,
2,5,0, ,-1.668,0,
3,10,0, ,-1.702,0,
4,15,0, ,-1.733,0,
5,20,0, ,-1.746,0,
6,25,0, ,-1.76,0,
7,30,0, ,-1.773,0,
8,35,0, ,-1.785,0,
9,40,0, ,-1.796,0,
10,45,0, ,-1.805,0,
11,50,0, ,-1.813,0,
12,55,0, ,-1.823,0,
13,60,0, ,-1.832,0,
14,65,0, ,-1.84,0,
15,70,0, ,-1.848,0,
16,75,0, ,-1.855,0,
17,80,0, ,-1.861,0,
18,85,0, ,-1.867,0,
19,90,0, ,-1.873,0,
20,95,0, ,-1.878,0,
21,100,0, ,-1.882,0,
22,125,0, ,-1.892,0,
23,150,0, , ---0---,0,
24,175,0, , ---0---,0,
25,200,0, , ---0---,0,
26,225,0, , ---0---,0,
27,250,0, , ---0---,0,
28,275,0, , ---0---,0,
29,300,0, , ---0---,0,
30,325,0, , ---0---,0,
31,350,0, , ---0---,0,
32,375,0, , ---0---,0,
33,400,0, , ---0---,0,
34,425,0, , ---0---,0,
35,450,0, , ---0---,0,
36,475,0, , ---0---,0,
37,500,0, , ---0---,0,
38,550,0, ,-1.898,0,
END OF VARIABLES SECTION,,,,,,
使用以下内容清理此部分:
<强> format.sh 强>:
#!/usr/bin/env bash
# use : bash format.sh pathname
cat "$1" | \
grep -v '^#\|^END' | \
sed 's/,/ /g' | tr -s " " | sed 's/ /,/'
获得:
CAST,9285001 WOD Unique Cast Number WOD code
NODC,Cruise ID US-10209
Originators,Station ID 82 integer
Originators,Cruise ID
Latitude,-76.477 decimal degrees
Longitude,166.3137 decimal degrees
Year,1997
Month,1
Day,1
Time,3.9931 decimal hours (UT)
METADATA,
Country,US NODC code UNITED STATES
Accession,Number 520 NODC code
Project,406 NODC code RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPONSE IN ROSS SEA
Platform,3596 OCL code NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c.s.KUS1475;IMO900725
Institute,431 NODC code US DOC NOAA NESDIS
Cast/Tow,Number 1
High,resolution CTD - Bottle 9182488
probe_type,7 OCL_code bottle/rossette/net
scale,Temperature 103 WOD code Temperature: ITS-90
Instrument,Temperature 411 WOD code CTD: SBE 911plus (Sea-Bird Electronics Inc.)
VARIABLES,Depth F O Temperatur F O
UNITS,m degrees C
Prof-Flag,0 0
1,0 0 -1.591 0
2,5 0 -1.668 0
3,10 0 -1.702 0
4,15 0 -1.733 0
5,20 0 -1.746 0
6,25 0 -1.76 0
7,30 0 -1.773 0
8,35 0 -1.785 0
9,40 0 -1.796 0
10,45 0 -1.805 0
11,50 0 -1.813 0
12,55 0 -1.823 0
13,60 0 -1.832 0
14,65 0 -1.84 0
15,70 0 -1.848 0
16,75 0 -1.855 0
17,80 0 -1.861 0
18,85 0 -1.867 0
19,90 0 -1.873 0
20,95 0 -1.878 0
21,100 0 -1.882 0
22,125 0 -1.892 0
23,150 0 ---0--- 0
24,175 0 ---0--- 0
25,200 0 ---0--- 0
26,225 0 ---0--- 0
27,250 0 ---0--- 0
28,275 0 ---0--- 0
29,300 0 ---0--- 0
30,325 0 ---0--- 0
31,350 0 ---0--- 0
32,375 0 ---0--- 0
33,400 0 ---0--- 0
34,425 0 ---0--- 0
35,450 0 ---0--- 0
36,475 0 ---0--- 0
37,500 0 ---0--- 0
38,550 0 -1.898 0
如果你有1M行,我想你有大约15 000个部分。
我得到了:
for _ in `seq 1 15000`; do cat one_section.txt >> data.txt; done
检查:
grep -n ^# data.txt | cut -d : -f1 | wc -l
wc -l data.txt
ls -sh data.txt
提供15000个部分,960000行和34MB。
....