我试图阅读许多文件。每个文件是每10分钟包含数据的每日数据文件。每个文件中的数据都像这样“分块”:
v = myArray[a+1][a+2] + otherArray[a+1][a+2];
文件会持续一整天每10分钟一次。该文件的文件名是151108.mnd。我希望我的代码读取所有11月的文件,所以1511 ??。mnd我希望我的代码在每天读取一个月的文件抓取所有日期时间行,所以对于部分数据文件示例我只是显示我希望我的代码能抓住2015-11-08 00:10:00,2015-11-08 00:20:00,2015-11-08 00:30:00等等存储为变量再转到第二天文件(151109.mnd)并获取所有日期时间行并存储为日期变量并附加到先前存储的日期。等等整个月等等。这是我到目前为止的代码:
2015-11-08 00:10:00 00:10:00
# z speed dir W sigW bck error
30 3.32 111.9 0.15 0.12 1.50E+05 0
40 3.85 108.2 0.07 0.14 7.75E+04 0
50 4.20 107.9 0.06 0.15 4.73E+04 0
60 4.16 108.5 0.03 0.19 2.73E+04 0
70 4.06 93.6 0.03 0.23 9.07E+04 0
80 4.06 93.8 0.07 0.28 1.36E+05 0
2015-11-08 00:20:00 00:10:00
# z speed dir W sigW bck error
30 3.79 120.9 0.15 0.11 7.79E+05 0
40 4.36 115.6 0.04 0.13 2.42E+05 0
50 4.71 113.6 0.07 0.14 6.84E+04 0
60 5.00 113.3 0.13 0.17 1.16E+04 0
70 4.29 94.2 0.22 0.20 1.38E+05 0
80 4.54 94.1 0.11 0.25 1.76E+05 0
2015-11-08 00:30:00 00:10:00
# z speed dir W sigW bck error
30 3.86 113.6 0.13 0.10 2.68E+05 0
40 4.34 116.1 0.09 0.11 1.41E+05 0
50 5.02 112.8 0.04 0.12 7.28E+04 0
60 5.36 110.5 0.01 0.14 5.81E+04 0
70 4.67 95.4 0.14 0.16 7.69E+04 0
80 4.56 95.0 0.15 0.21 9.84E+04 0
...
此代码存在一些问题,因为当我打印日期时,它会打印出每个日期的两个副本,并且它也只打印出每个文件的第一个日期所以2015-11-08 00:10:00,2015-11-09 00:10:00等等。然后在每个文件中逐行进行,然后将该文件中的所有日期存储到我想要的下一个文件中。相反,它只是抓住每个文件中的第一个日期。有关此代码的任何帮助吗?有没有更简单的方法来做我想要的?谢谢!
答案 0 :(得分:1)
考虑在作为数据帧读入之前逐行修改csv数据。下面打开glob列表中的原始文件并写入临时文件,将日期移到最后一列,删除多个标题和空行。
CSV 数据(假设csv文件的文本视图如下所示;如果与实际不同,则调整py代码)
2015-11-0800:10:0000:10:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.32,111.9,0.15,0.12,1.50E+05,0
40,3.85,108.2,0.07,0.14,7.75E+04,0
50,4.2,107.9,0.06,0.15,4.73E+04,0
60,4.16,108.5,0.03,0.19,2.73E+04,0
70,4.06,93.6,0.03,0.23,9.07E+04,0
80,4.06,93.8,0.07,0.28,1.36E+05,0
,,,,,,
2015-11-0800:10:0000:20:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.79,120.9,0.15,0.11,7.79E+05,0
40,4.36,115.6,0.04,0.13,2.42E+05,0
50,4.71,113.6,0.07,0.14,6.84E+04,0
60,5,113.3,0.13,0.17,1.16E+04,0
70,4.29,94.2,0.22,0.2,1.38E+05,0
80,4.54,94.1,0.11,0.25,1.76E+05,0
,,,,,,
2015-11-0800:10:0000:30:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.86,113.6,0.13,0.1,2.68E+05,0
40,4.34,116.1,0.09,0.11,1.41E+05,0
50,5.02,112.8,0.04,0.12,7.28E+04,0
60,5.36,110.5,0.01,0.14,5.81E+04,0
70,4.67,95.4,0.14,0.16,7.69E+04,0
80,4.56,95,0.15,0.21,9.84E+04,0
Python 脚本
import glob, os
import pandas as pd
filenames = glob.glob('1511??.mnd')
temp = 'temp.csv'
# INITIATE EMPTY DATAFRAME
data_nov15_hereford = pd.DataFrame(columns=['z', 'speed', 'dir', 'W',
'sigW', 'bck', 'error', 'date'])
# ITERATE THROUGH EACH FILE IN GLOB LIST
for file in filenames:
# DELETE PRIOR TEMP VERSION
if os.path.exists(temp): os.remove(temp)
header = 0
# READ IN ORIGINAL CSV
with open(file, 'r') as txt1:
for rline in txt1:
# SAVE DATE VALUE THEN SKIP ROW
if "2015-11" in rline: date = rline.replace(',',''); continue
# SKIP BLANK LINES (CHANGE IF NO COMMAS)
if rline == ',,,,,,\n': continue
# ADD NEW 'DATE' COLUMN AND SKIP OTHER HEADER LINES
if 'z,speed,dir,W,sigW,bck,error' in rline:
if header == 1: continue
rline = rline.replace('\n', ',date\n')
with open(temp, 'a') as txt2:
txt2.write(rline)
continue
header = 1
# APPEND LINE TO TEMP CSV WITH DATE VALUE
with open(temp, 'a') as txt2:
txt2.write(rline.replace('\n', ','+date))
# APPEND TEMP FILE TO DATA FRAME
data_nov15_hereford = data_nov15_hereford.append(pd.read_csv(temp))
<强>输出强>
z speed dir W sigW bck error date
0 30 3.32 111.9 0.15 0.12 150000 0 2015-11-0800:10:0000:10:00
1 40 3.85 108.2 0.07 0.14 77500 0 2015-11-0800:10:0000:10:00
2 50 4.20 107.9 0.06 0.15 47300 0 2015-11-0800:10:0000:10:00
3 60 4.16 108.5 0.03 0.19 27300 0 2015-11-0800:10:0000:10:00
4 70 4.06 93.6 0.03 0.23 90700 0 2015-11-0800:10:0000:10:00
5 80 4.06 93.8 0.07 0.28 136000 0 2015-11-0800:10:0000:10:00
6 30 3.79 120.9 0.15 0.11 779000 0 2015-11-0800:10:0000:20:00
7 40 4.36 115.6 0.04 0.13 242000 0 2015-11-0800:10:0000:20:00
8 50 4.71 113.6 0.07 0.14 68400 0 2015-11-0800:10:0000:20:00
9 60 5.00 113.3 0.13 0.17 11600 0 2015-11-0800:10:0000:20:00
10 70 4.29 94.2 0.22 0.20 138000 0 2015-11-0800:10:0000:20:00
11 80 4.54 94.1 0.11 0.25 176000 0 2015-11-0800:10:0000:20:00
12 30 3.86 113.6 0.13 0.10 268000 0 2015-11-0800:10:0000:30:00
13 40 4.34 116.1 0.09 0.11 141000 0 2015-11-0800:10:0000:30:00
14 50 5.02 112.8 0.04 0.12 72800 0 2015-11-0800:10:0000:30:00
15 60 5.36 110.5 0.01 0.14 58100 0 2015-11-0800:10:0000:30:00
16 70 4.67 95.4 0.14 0.16 76900 0 2015-11-0800:10:0000:30:00
17 80 4.56 95.0 0.15 0.21 98400 0 2015-11-0800:10:0000:30:00
答案 1 :(得分:1)
一些观察结果:
第一:为什么你只获得文件中的第一个日期:
f_nov15_hereford = pd.read_csv(i, skiprows = 32)
for line in f_nov15_hereford:
if line.startswith("20"):
第一行将文件读入pandas数据帧。第二行迭代数据帧的列,而不是行。结果,最后一行检查列是否以&#34; 20&#34;开头。每个文件只发生一次。
第二:counter
已初始化,其值已更改,但从未使用过。我认为它的目的是用来跳过文件中的行。
第三:将所有日期收集到Python列表中然后根据需要将其转换为pandas数据帧可能更简单。
import pandas as pd
import glob
import datetime as dt
# number of lines to skip before the first date
offset = 32
# number of lines from one date to the next
recordlength = 9
pattern = '1511??.mnd'
dates = []
for filename in glob.iglob(pattern):
with open(filename) as datafile:
count = -offset
for line in datafile:
if count == 0:
fmt = '%Y-%m-%d %H:%M:%S %f'
date_object = dt.datetime.strptime(line[:-6], fmt)
dates.append(date_object)
count += 1
if count == recordlength:
count = 0
data_nov15_hereford = pd.DataFrame(dates, columns=['Dates'])
print dates