我正在处理超过6MM行的股票代码数据。我想获取符号的所有数据,进行我需要的处理,然后输出结果。
我编写的代码告诉我每个自动收报机的起始行(请参阅下面的代码)。我想如果我知道新符号的起始位置(而不是行号)会更有效,所以我可以使用搜索(#)轻松跳转到自动收报机的起始位置。我也很好奇如何扩展这个逻辑来读取一个自动收录器的整个数据块(start_position到end_position)。
import csv
data_line = 0 # holds the file line number for the symbol
ticker_start = 0
ticker_end = 0
cur_sec_ticker = ""
ticker_dl = [] # array for holding the line number in the source file for the start of each ticker
reader = csv.reader(open('C:\\temp\sample_data.csv', 'rb'), delimiter=',')
for row in reader:
if cur_sec_ticker != row[1]: # only process a new ticker
ticker_fr = str(data_line) + ',' + row[1] # prep line for inserting into array
# desired line for inserting into array, ticker_end would be the last
# of the current ticker data block, which is the start of the next ticker
# block (ticker_start - 1)
#ticker_fr = str(ticker_start) + str(ticker_end) + str(data_line) + ',' + row[1]
print ticker_fr
ticker_dl.append(ticker_fr)
cur_sec_ticker = row[1]
data_line += 1
print ticker_dl
下面我放置了一个关于数据文件的小样本:
seq,Symbol,Date,Open,High,Low,Close,Volume,MA200Close,MA50Close,PrimaryLast,filter_$
1,A,1/1/2008,36.74,36.74,36.74,36.74,0, , ,1,1
2,A,1/2/2008,36.67,36.8,36.12,36.3,1858900, , ,1,1
3,A,1/3/2008,36.3,36.35,35.87,35.94,1980100, , ,1,1
1003,AA,1/1/2008,36.55,36.55,36.55,36.55,0, , ,1,1
1004,AA,1/2/2008,36.46,36.78,36,36.13,7801600, , ,1,1
1005,AA,1/3/2008,36.18,36.67,35.74,36.19,7169000, , ,1,1
2005,AAN,4/20/2009,20,20.7,18.2067,18.68,808700, , ,1,1
2006,AAN,4/21/2009,18.7,19.06,18.6533,18.9933,530200, , ,1,1
2007,AAN,4/22/2009,19.2867,19.6267,18.54,19.1333,801100, , ,1,1
2668,AAP,1/1/2008,37.99,37.99,37.99,37.99,0, , ,1,1
2669,AAP,1/2/2008,37.99,38.15,37.17,37.59,1789200, , ,1,1
2670,AAP,1/3/2008,37.58,38.16,37.35,37.95,1584700, , ,1,1
3670,AAR,1/1/2008,22.94,22.94,22.94,22.94,0, , ,1,1
3671,AAR,1/2/2008,23.1,23.38,22.86,23.15,17100, , ,1,1
3672,AAR,1/3/2008,23,23,22,22.16,45600, , ,1,1
6886,ABB,1/1/2008,28.8,28.8,28.8,28.8,0, , ,1,1
6887,ABB,1/2/2008,29,29.11,28.23,28.64,4697700, , ,1,1
6888,ABB,1/3/2008,27.92,28.35,27.79,28.08,5240100, , ,1,1
答案 0 :(得分:1)
通常,您可以使用tell
方法获取文件对象的当前位置。但是,可能很难使用当前代码将文件读取委托给csv
模块。在逐行阅读时甚至很难做到这一点,因为底层文件对象可能会以更大的块读取而不是单行(readline
和readlines
方法在后台执行一些缓存来隐藏这个来自你。)
虽然我会跳过读取特定字节的整个想法,但如果它对你的程序来说真的很值得,你可能需要自己负责读取文件,这样你就可以准确地跟踪你所在的文件。始终存档。可能没有必要tell
。
这样的东西可能会读取一大块数据,然后将其拆分为行和值,同时跟踪到目前为止已读取的字节数:
def generate_values(f):
buf = "" # a buffer of data read from the file
pos = 0 # the position of our buffer within the file
while True: # loop until we return at the end of the file
new_data = f.read(4096) # read up to 4k bytes at a time
if not new_data: # quit if we got nothing
if buf:
yield pos, buf.split(",") # handle any data after last newline
return
buf += new_data
line_start = 0 # index into buf
try:
while True: # loop until an exception is raised at end of buf
line_end = buf.index("\n", line_start) # find end of line
line = buf[line_start:line_end] # excludes the newline
if line: # skips blank lines
yield pos+line_start, line.split(",") # yield pos,data tuple
line_start = line_end+1
except ValueError: # raised by `index()`
pass
pos += line_end + 1
buf = buf[line_end + 1:] # keep left over data from end of the buffer
如果您的文件的行结尾不是\n
,则可能需要稍微调整一下,但这不应该太难。