Question

在Python 2.7中使用以下内容：

dfile = 'new_data.txt'   #  Depth file no. 1
d_row = [line.strip() for line in open(dfile)]

我已将数据文件加载到没有换行符的列表中。现在我想索引d_row中的所有元素，其中字符串的开头不是数字和/或空。接下来，我要求：

删除所有上述详细的非数字实例和
保存字符串和索引，以便以后插入更新的文件。

数据示例：

Thu Mar 14 18:17:05 2013                                                       
Fri Mar 15 01:40:25 2013

FT

DepthChange: 0.000000,2895.336,0.000
1363285025.250000,9498.970
1363285025.300000,9498.970
1363285026.050000,9498.970
1363287840.450042,9458.010
1363287840.500042,9458.010
1363287840.850042,9458.010
1363287840.900042,9458.010
DepthChange: 0.000000,2882.810,9457.200
1363287840.950042,9458.010
DepthChange: 0.000000,2882.810,0.000
1363287841.000042,9457.170
1363287841.050042,9457.170
1363287841.100042,9457.170
1363287841.150042,9457.170
1363287841.200042,9457.170
1363287841.250042,9457.170
1363287841.300042,9457.170
1363291902.750102,9149.937
1363291902.800102,9149.822
1363291902.850102,9149.822
1363291902.900102,9149.822
1363291902.950102,9149.822
1363291903.000102,9149.822
1363291903.050102,9149.708
1363291903.100102,9149.708
1363291903.150102,9149.708
1363291903.200102,9149.708
1363291903.250102,9149.708
1363291903.300102,9149.592
1363291903.350102,9149.592
1363291903.400102,9149.592
1363291903.450102,9149.592
1363291903.500102,9149.592
DepthChange: 0.000000,2788.770,2788.709
1363291903.550102,9149.479
1363291903.600102,9149.379

我一直在手动执行删除步骤，这非常耗时，因为该文件包含超过五十万行。目前，我无法通过一些修改来重写包含所有原始元素的文件。

任何提示都会非常感激。

Answer 1

dfile = 'new_data.txt'
with open(dfile) as infile:
  numericLines = set() # line numbers of lines that start with digits
  emptyLines = set() # line numbers of lines that are empty
  charLines = [] # line numbers of lines that start with a letter
  for lineno, line in enumerate(infile):
    if line[0].isalpha:
      charLines.append(line.strip())
    elif line[0].isdigit():
      numericLines.add(lineno)
    elif not line.strip():
      emptyLines.add(lineno)

Answer 2

最简单的方法是两次传递：首先获取不匹配线的线和线号，然后获取匹配线的线。

d_rows = [line.strip() for line in open(dfile)]
good_rows = [(i, row) for i, row in enumerate(d_rows) if is_good_row(row)]
bad_rows = [(i, row) for i, row in enumerate(d_rows) if not is_good_row(row)]

这确实意味着在列表上进行两次传递，但谁在乎呢？如果列表足够小，可以像往常一样将整个内容读入内存，那么额外的成本可能可以忽略不计。

或者，如果您需要避免在两遍中构建两个列表的成本，您可能还需要避免一次性读取整个文件，因此您必须更聪明地做一些事情：

d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
good_rows, bad_rows = [], []
for i, row in enumerate(d_rows):
    if is_good_row(row):
        good_rows.append((i, row))
    else:
        bad_rows.append((i, row))

如果你可以把事情推得更远甚至不需要明确的good_rows和bad_rows列表，那么你可以将所有内容保存在迭代器中，并且不要浪费记忆或前期阅读时间：

d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
with open(outfile, 'w') as f:
    for i, row in enumerate(d_rows):
        if is_good_row(row):
            f.write(row + '\n')
        else:
            whatever_you_wanted_to_do_with(i, row)

Answer 3

感谢所有回答我问题的人。使用每个回复的一部分，我能够达到预期的结果。最终奏效如下：

goodrow_ind, badrow_ind, badrows = [], [], []

d_rows = (line for line in open(ifile))
with open(ofile, 'w') as f:
    for i, row in enumerate(d_rows):
        if row[0].isdigit():
            f.write(row)
            goodrow_ind.append((i))
        else:
            badrow_ind.append((i))
            badrows.append((row))

ifile.close()

data = np.loadtxt(open(ofile,'rb'),delimiter=',')

结果是“好”和“坏”行用每个索引分隔。

查找并从列表中删除元素，同时保留稍后插入的位置

3 个答案: