在Python 2.7中使用以下内容:
dfile = 'new_data.txt' # Depth file no. 1
d_row = [line.strip() for line in open(dfile)]
我已将数据文件加载到没有换行符的列表中。现在我想索引d_row中的所有元素,其中字符串的开头不是数字和/或空。接下来,我要求:
数据示例:
Thu Mar 14 18:17:05 2013
Fri Mar 15 01:40:25 2013
FT
DepthChange: 0.000000,2895.336,0.000
1363285025.250000,9498.970
1363285025.300000,9498.970
1363285026.050000,9498.970
1363287840.450042,9458.010
1363287840.500042,9458.010
1363287840.850042,9458.010
1363287840.900042,9458.010
DepthChange: 0.000000,2882.810,9457.200
1363287840.950042,9458.010
DepthChange: 0.000000,2882.810,0.000
1363287841.000042,9457.170
1363287841.050042,9457.170
1363287841.100042,9457.170
1363287841.150042,9457.170
1363287841.200042,9457.170
1363287841.250042,9457.170
1363287841.300042,9457.170
1363291902.750102,9149.937
1363291902.800102,9149.822
1363291902.850102,9149.822
1363291902.900102,9149.822
1363291902.950102,9149.822
1363291903.000102,9149.822
1363291903.050102,9149.708
1363291903.100102,9149.708
1363291903.150102,9149.708
1363291903.200102,9149.708
1363291903.250102,9149.708
1363291903.300102,9149.592
1363291903.350102,9149.592
1363291903.400102,9149.592
1363291903.450102,9149.592
1363291903.500102,9149.592
DepthChange: 0.000000,2788.770,2788.709
1363291903.550102,9149.479
1363291903.600102,9149.379
我一直在手动执行删除步骤,这非常耗时,因为该文件包含超过五十万行。目前,我无法通过一些修改来重写包含所有原始元素的文件。
任何提示都会非常感激。
答案 0 :(得分:0)
dfile = 'new_data.txt'
with open(dfile) as infile:
numericLines = set() # line numbers of lines that start with digits
emptyLines = set() # line numbers of lines that are empty
charLines = [] # line numbers of lines that start with a letter
for lineno, line in enumerate(infile):
if line[0].isalpha:
charLines.append(line.strip())
elif line[0].isdigit():
numericLines.add(lineno)
elif not line.strip():
emptyLines.add(lineno)
答案 1 :(得分:0)
最简单的方法是两次传递:首先获取不匹配线的线和线号,然后获取匹配线的线。
d_rows = [line.strip() for line in open(dfile)]
good_rows = [(i, row) for i, row in enumerate(d_rows) if is_good_row(row)]
bad_rows = [(i, row) for i, row in enumerate(d_rows) if not is_good_row(row)]
这确实意味着在列表上进行两次传递,但谁在乎呢?如果列表足够小,可以像往常一样将整个内容读入内存,那么额外的成本可能可以忽略不计。
或者,如果您需要避免在两遍中构建两个列表的成本,您可能还需要避免一次性读取整个文件,因此您必须更聪明地做一些事情:
d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
good_rows, bad_rows = [], []
for i, row in enumerate(d_rows):
if is_good_row(row):
good_rows.append((i, row))
else:
bad_rows.append((i, row))
如果你可以把事情推得更远甚至不需要明确的good_rows
和bad_rows
列表,那么你可以将所有内容保存在迭代器中,并且不要浪费记忆或前期阅读时间:
d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
with open(outfile, 'w') as f:
for i, row in enumerate(d_rows):
if is_good_row(row):
f.write(row + '\n')
else:
whatever_you_wanted_to_do_with(i, row)
答案 2 :(得分:0)
感谢所有回答我问题的人。使用每个回复的一部分,我能够达到预期的结果。最终奏效如下:
goodrow_ind, badrow_ind, badrows = [], [], []
d_rows = (line for line in open(ifile))
with open(ofile, 'w') as f:
for i, row in enumerate(d_rows):
if row[0].isdigit():
f.write(row)
goodrow_ind.append((i))
else:
badrow_ind.append((i))
badrows.append((row))
ifile.close()
data = np.loadtxt(open(ofile,'rb'),delimiter=',')
结果是“好”和“坏”行用每个索引分隔。